Corpora: Number of distinct words

Giorgio Parisi Giorgio.Parisi at roma1.infn.it
Sun Oct 28 11:40:00 UTC 2001


On Thu, 25 Oct 2001, Granger Sylviane wrote:

> Dear list members,
> 
> Could anyone help me answer the following message which I've just received 
> from a colleague of mine in the Computer Science Department?
> 
> Many thanks.
> 
> Have a good day!
> Sylviane Granger
> 
> >Since about 1.5 years, a colleague and I have been writing a textbook
> >on computer programming. I have kept numerous drafts of the book during
> >this period. Today I was curious to see how these drafts evolved. I
> >graphed the number of distinct 'words' (character sequences delimited
> >by noncharacters) as a function of file size.  I found that a good fit
> >is given by the square root function:
> >
> >   (number of distinct words) = 6 * sqrt(file size)
> >
> >Is this an example of a general law?  I.e., if the text just repeated
> >the same over and over the exponent would be zero.  If the text was a
> >long catalogue of facts the exponent would be one.  The exponent is
> >exactly half way in between.  Is it because of the structure of the
> >book (the effort to make it coherent)?  I don't know.  Any comments or
> >reactions welcome!
> >
> >I know of 'Zipf's Law' : word frequency is (supposedly) inversely
> >proportional to the word's rank (1st, 2nd, 3rd most frequent, etc.).
> >Is the square root a consequence of Zipf's Law?  Or is there more going
> >on?
> >
> >Peter Van Roy
> 
> 
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> Professor Sylviane Granger
> Université Catholique de Louvain
> Centre for English Corpus Linguistics
> Collège Erasme
> Place Blaise Pascal 1
> B-1348 Louvain-la-Neuve
> Belgium
> Fax: + 3210474942
> http://www.fltr.ucl.ac.be/FLTR/GERM/ETAN/CECL/cecl.html
> 
> 
A strict application of the Zipf's Law implies that the number of
words is proportional to the log of the file size. 
My impression is this is what happens if you take novels.
Technical books may behave in a different way.
Best regards

Giorgio  
-------------------------------------------------------------------------
Dipartimento di Fisica                        Fax +39-06-4463158
Universita' di Roma "La Sapienza"             giorgio.parisi at roma1.infn.it
P.le A. Moro 2                                Tel +39-06-49913481
Roma, Italy, I-00185     http://chimera.roma1.infn.it/GIORGIO/giorgio.html 
------------------------------------------------------------------------



More information about the Corpora mailing list