[Corpora-List] Corpus size and accuracy of frequency

Sun Apr 5 13:48:34 UTC 2009

Hi Mark

I have some data from the Birmingham Collection of English Text (18m; c 1986)
and the Bank of English corpus (418m; c 2000) which may be relevant to your
question.

Unfortunately this comparison is very inexact. The 2 corpora were compiled 14 years apart,
using different design policies, data collection strategies and procedures, and different
technologies; the corpora differ substantially in composition; and the frequencies were based
on different tokenization principles, etc etc.

Also, I do not have lemmatized frequencies to offer, only type frequencies. And I only
have the examples given below, and cannot generate any new lists.

However, the fact that there were (albeit small) changes in rank even in the top 10 items
of the type frequency lists suggests that effects of corpus size on lemmas lower down the
lists could be substantial:

CORPUS

18m

418m

the

1,081,654

22,849,031

of

535,391

10,551,630

and

511,333

9,787,093

to

479,191

10,429,009

a

419,798

9,279,905

in

334,183

7,518,069

that

215,332

4175495

s

4072762

is

3900784

it

198,578

3771509

for

3690466

i

197,055

3216005

was

194,286

3092967

An inspection of some random types at various levels in the lists seems to bear this out. By rank 5000
in the 18m corpus, we see variations of 5000+ ranks in the 418m corpus (i.e. from 'prey' downwards):

CORPUS

18m

418m

RANK

FREQ

RANK

FREQ

been

48

48,068

47

1,019,904

people

75

26,057

72

610,679

how

94

20,906

104

393,586

going

129

14,924

147

288,607

away

150

12,168

225

185,260

house

176

9,890

206

198,592

widely

2,500

660

2,486

17,804

prey

5,000

280

9,211

3,185

fulfilment

10,000

107

15,122

1,506

balloon

15,000

58

9,011

3,298

compromises

20,000

37

16,395

1,327

scenic

25,000

26

15,651

1,429

fungal

40,000

11

25,633

628

peyote

70,000

4

58,153

129

I do not know what would happen if you (for example) extracted a subset of complete texts from a
100m corpus to form a 10m corpus or 1m corpus. But perhaps this exercise has in effect been
conducted already with the BNC, when they produced the Sampler, World Edition, etc? This would
at least reduce many of the differences between BCET and BoE that I mentioned earlier. And perhaps
the relevant lemma lists already exist?

Your proposal of selecting every 10th running word from the texts in a 100m corpus to create
a '10m corpus' would imply approximately even distribution of types across the 100m corpus?

You mention multiword items in your email, but wouldn't your proposed procedure deny any generic or systemic
effect of the collocational and phraseological tendencies of language on the frequency of individual types (which
would be further affected by lemmatization)?

Also, wouldn't it affect different types/lemmas differently? For example, the high frequency of the
content word/type 'time' in any general corpus of English must be greatly affected by its occurrence in many common
phrases? Whereas the content word/type 'people' (usually also of similarly high frequency) might participate less in phrases,
and be used more in isolated contexts, and hence be less afected?

Creating lemmatized frequency lists of a 10m corpus created in this way would imply that the members of each lemma were
also distributed roughly evenly across the 100m corpus?

I have of course until now by-passed a major linguistic issue: which definition of lemma you are using, and how that affects
any lemmatized frequency lists produced.

Although I feel neither mathematically nor linguistically competent to say much more without further
evidence and discussion, wouldn't it be relatively straightforward (computationally) to implement your proposal on existing corpora?
I would certainly be very interested to know the results!

Best
Ramesh

Ramesh Krishnamurthy
Lecturer in English Studies, School of Languages and Social Sciences,
Aston University, Birmingham B4 7ET, UK
Tel: +44 (0)121-204-3812 ; Fax: +44 (0)121-204-3766 [Room NX08, 10th
Floor, North Wing of Main Building]
http://www1.aston.ac.uk/lss/staff/krishnamurthyr/
Director, ACORN (Aston Corpus Network project): http://acorn.aston.ac.uk/
Date: Fri, 3 Apr 2009 08:45:35 -0600
From: Mark Davies <Mark_Davies at byu.edu>
Subject: Re: [Corpora-List] Corpus size and accuracy of frequency
      listings
To: "corpora at uib.no" <corpora at uib.no>

> Dear Mark,
> I don't think your question makes much sense -- possibly because you fail to explain what is the purpose of your frequency lists.

No, I didn't give all of the relevant details in the first message. The main issue is what is a an "adequate" corpus size to create a lemma list of X number of words in a given language. If it's a top 10,000 lemma list, is 10,000,000 words adequate? Is 100,000,000 much better? The main point -- is it worth the effort to create a corpus ten times the size for only a small increase in accuracy? And I'm not just asking for the sake of curiosity -- there's an upcoming project that needs some data on this.

>> The effect of picking every 5th or 50th running word on the ranked list...

It would be every 5th or 50th word of running text *in the corpus*, *not* the ranked list. In this way, even words that occur mainly in multiword expressions should be fine. Adjacent words X1 and X2 would each be counted as would any other word. Sometimes the first word would be retrieved as we take words 1, 11, 21, 31... etc, and sometimes it would be the second word. It would never take the whole multiword expression together, of course, but then we're just after 1-grams for the lemma list (unless we *want* to preserve multiword units in the list, as in earlier versions of the BNC, for example).

And again, I'm not proposing to actually reduce a 100 million word corpus down to a 10 million word corpus -- that wouldn't make any sense. The point is whether -- for a ranked lemma list of size X -- a 10 million word corpus, for example, might be nearly as adequate as a 100 million word corpus (all other things -- genres, etc -- being equal).

Mark D.

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906

http://davies-linguistics.byu.edu<http://davies-linguistics.byu.edu/>

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090405/f1faea9e/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora