Corpora: Number of distinct words
Granger Sylviane
granger at lige.ucl.ac.be
Thu Oct 25 07:17:16 UTC 2001
Dear list members,
Could anyone help me answer the following message which I've just received
from a colleague of mine in the Computer Science Department?
Many thanks.
Have a good day!
Sylviane Granger
>Since about 1.5 years, a colleague and I have been writing a textbook
>on computer programming. I have kept numerous drafts of the book during
>this period. Today I was curious to see how these drafts evolved. I
>graphed the number of distinct 'words' (character sequences delimited
>by noncharacters) as a function of file size. I found that a good fit
>is given by the square root function:
>
> (number of distinct words) = 6 * sqrt(file size)
>
>Is this an example of a general law? I.e., if the text just repeated
>the same over and over the exponent would be zero. If the text was a
>long catalogue of facts the exponent would be one. The exponent is
>exactly half way in between. Is it because of the structure of the
>book (the effort to make it coherent)? I don't know. Any comments or
>reactions welcome!
>
>I know of 'Zipf's Law' : word frequency is (supposedly) inversely
>proportional to the word's rank (1st, 2nd, 3rd most frequent, etc.).
>Is the square root a consequence of Zipf's Law? Or is there more going
>on?
>
>Peter Van Roy
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Professor Sylviane Granger
Université Catholique de Louvain
Centre for English Corpus Linguistics
Collège Erasme
Place Blaise Pascal 1
B-1348 Louvain-la-Neuve
Belgium
Fax: + 3210474942
http://www.fltr.ucl.ac.be/FLTR/GERM/ETAN/CECL/cecl.html
More information about the Corpora
mailing list