[Corpora-List] Corpus size and accuracy of frequency
Krishnamurthy, Ramesh
r.krishnamurthy at aston.ac.uk
Sun Apr 5 13:48:34 UTC 2009
Hi Mark
I have some data from the Birmingham Collection of English Text (18m; c 1986)
and the Bank of English corpus (418m; c 2000) which may be relevant to your
question.
Unfortunately this comparison is very inexact. The 2 corpora were compiled 14 years apart,
using different design policies, data collection strategies and procedures, and different
technologies; the corpora differ substantially in composition; and the frequencies were based
on different tokenization principles, etc etc.
Also, I do not have lemmatized frequencies to offer, only type frequencies. And I only
have the examples given below, and cannot generate any new lists.
However, the fact that there were (albeit small) changes in rank even in the top 10 items
of the type frequency lists suggests that effects of corpus size on lemmas lower down the
lists could be substantial:
CORPUS
18m
418m
the
1,081,654
22,849,031
of
535,391
10,551,630
and
511,333
9,787,093
to
479,191
10,429,009
a
419,798
9,279,905
in
334,183
7,518,069
that
215,332
4175495
s
4072762
is
3900784
it
198,578
3771509
for
3690466
i
197,055
3216005
was
194,286
3092967
An inspection of some random types at various levels in the lists seems to bear this out. By rank 5000
in the 18m corpus, we see variations of 5000+ ranks in the 418m corpus (i.e. from 'prey' downwards):
CORPUS
18m
418m
RANK
FREQ
RANK
FREQ
been
48
48,068
47
1,019,904
people
75
26,057
72
610,679
how
94
20,906
104
393,586
going
129
14,924
147
288,607
away
150
12,168
225
185,260
house
176
9,890
206
198,592
widely
2,500
660
2,486
17,804
prey
5,000
280
9,211
3,185
fulfilment
10,000
107
15,122
1,506
balloon
15,000
58
9,011
3,298
compromises
20,000
37
16,395
1,327
scenic
25,000
26
15,651
1,429
fungal
40,000
11
25,633
628
peyote
70,000
4
58,153
129
I do not know what would happen if you (for example) extracted a subset of complete texts from a
100m corpus to form a 10m corpus or 1m corpus. But perhaps this exercise has in effect been
conducted already with the BNC, when they produced the Sampler, World Edition, etc? This would
at least reduce many of the differences between BCET and BoE that I mentioned earlier. And perhaps
the relevant lemma lists already exist?
Your proposal of selecting every 10th running word from the texts in a 100m corpus to create
a '10m corpus' would imply approximately even distribution of types across the 100m corpus?
You mention multiword items in your email, but wouldn't your proposed procedure deny any generic or systemic
effect of the collocational and phraseological tendencies of language on the frequency of individual types (which
would be further affected by lemmatization)?
Also, wouldn't it affect different types/lemmas differently? For example, the high frequency of the
content word/type 'time' in any general corpus of English must be greatly affected by its occurrence in many common
phrases? Whereas the content word/type 'people' (usually also of similarly high frequency) might participate less in phrases,
and be used more in isolated contexts, and hence be less afected?
Creating lemmatized frequency lists of a 10m corpus created in this way would imply that the members of each lemma were
also distributed roughly evenly across the 100m corpus?
I have of course until now by-passed a major linguistic issue: which definition of lemma you are using, and how that affects
any lemmatized frequency lists produced.
Although I feel neither mathematically nor linguistically competent to say much more without further
evidence and discussion, wouldn't it be relatively straightforward (computationally) to implement your proposal on existing corpora?
I would certainly be very interested to know the results!
Best
Ramesh
Ramesh Krishnamurthy
Lecturer in English Studies, School of Languages and Social Sciences,
Aston University, Birmingham B4 7ET, UK
Tel: +44 (0)121-204-3812 ; Fax: +44 (0)121-204-3766 [Room NX08, 10th
Floor, North Wing of Main Building]
http://www1.aston.ac.uk/lss/staff/krishnamurthyr/
Director, ACORN (Aston Corpus Network project): http://acorn.aston.ac.uk/
Date: Fri, 3 Apr 2009 08:45:35 -0600
From: Mark Davies <Mark_Davies at byu.edu>
Subject: Re: [Corpora-List] Corpus size and accuracy of frequency
listings
To: "corpora at uib.no" <corpora at uib.no>
> Dear Mark,
> I don't think your question makes much sense -- possibly because you fail to explain what is the purpose of your frequency lists.
No, I didn't give all of the relevant details in the first message. The main issue is what is a an "adequate" corpus size to create a lemma list of X number of words in a given language. If it's a top 10,000 lemma list, is 10,000,000 words adequate? Is 100,000,000 much better? The main point -- is it worth the effort to create a corpus ten times the size for only a small increase in accuracy? And I'm not just asking for the sake of curiosity -- there's an upcoming project that needs some data on this.
>> The effect of picking every 5th or 50th running word on the ranked list...
It would be every 5th or 50th word of running text *in the corpus*, *not* the ranked list. In this way, even words that occur mainly in multiword expressions should be fine. Adjacent words X1 and X2 would each be counted as would any other word. Sometimes the first word would be retrieved as we take words 1, 11, 21, 31... etc, and sometimes it would be the second word. It would never take the whole multiword expression together, of course, but then we're just after 1-grams for the lemma list (unless we *want* to preserve multiword units in the list, as in earlier versions of the BNC, for example).
And again, I'm not proposing to actually reduce a 100 million word corpus down to a 10 million word corpus -- that wouldn't make any sense. The point is whether -- for a ranked lemma list of size X -- a 10 million word corpus, for example, might be nearly as adequate as a 100 million word corpus (all other things -- genres, etc -- being equal).
Mark D.
============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
http://davies-linguistics.byu.edu<http://davies-linguistics.byu.edu/>
** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090405/f1faea9e/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list