Corpora: Number of distinct words

Granger Sylviane granger at lige.ucl.ac.be
Thu Oct 25 07:17:16 UTC 2001


Dear list members,

Could anyone help me answer the following message which I've just received 
from a colleague of mine in the Computer Science Department?

Many thanks.

Have a good day!
Sylviane Granger

>Since about 1.5 years, a colleague and I have been writing a textbook
>on computer programming. I have kept numerous drafts of the book during
>this period. Today I was curious to see how these drafts evolved. I
>graphed the number of distinct 'words' (character sequences delimited
>by noncharacters) as a function of file size.  I found that a good fit
>is given by the square root function:
>
>   (number of distinct words) = 6 * sqrt(file size)
>
>Is this an example of a general law?  I.e., if the text just repeated
>the same over and over the exponent would be zero.  If the text was a
>long catalogue of facts the exponent would be one.  The exponent is
>exactly half way in between.  Is it because of the structure of the
>book (the effort to make it coherent)?  I don't know.  Any comments or
>reactions welcome!
>
>I know of 'Zipf's Law' : word frequency is (supposedly) inversely
>proportional to the word's rank (1st, 2nd, 3rd most frequent, etc.).
>Is the square root a consequence of Zipf's Law?  Or is there more going
>on?
>
>Peter Van Roy


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Professor Sylviane Granger
Université Catholique de Louvain
Centre for English Corpus Linguistics
Collège Erasme
Place Blaise Pascal 1
B-1348 Louvain-la-Neuve
Belgium
Fax: + 3210474942
http://www.fltr.ucl.ac.be/FLTR/GERM/ETAN/CECL/cecl.html



More information about the Corpora mailing list