Corpora: Number of distinct words
Alexander Gelbukh
gelbukh at cic.ipn.mx
Sat Oct 27 00:16:41 UTC 2001
Dear colleagues,
Maybe the following paper is relevant:
See
http://www.cic.ipn.mx/~gelbukh/CV/Publications/2001/CICLing-2001-Zipf.htm.
Thank you!
Alexander
> -----Original Message-----
> From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no]On
> Behalf Of Granger Sylviane
> Sent: Thursday, October 25, 2001 1:17 AM
> To: CORPORA at HD.UIB.NO
> Subject: Corpora: Number of distinct words
>
>
> Dear list members,
>
> Could anyone help me answer the following message which I've
> just received
> from a colleague of mine in the Computer Science Department?
>
> Many thanks.
>
> Have a good day!
> Sylviane Granger
>
> >Since about 1.5 years, a colleague and I have been writing a textbook
> >on computer programming. I have kept numerous drafts of the
> book during
> >this period. Today I was curious to see how these drafts evolved. I
> >graphed the number of distinct 'words' (character sequences delimited
> >by noncharacters) as a function of file size. I found that
> a good fit
> >is given by the square root function:
> >
> > (number of distinct words) = 6 * sqrt(file size)
> >
> >Is this an example of a general law? I.e., if the text just repeated
> >the same over and over the exponent would be zero. If the text was a
> >long catalogue of facts the exponent would be one. The exponent is
> >exactly half way in between. Is it because of the structure of the
> >book (the effort to make it coherent)? I don't know. Any
> comments or
> >reactions welcome!
> >
> >I know of 'Zipf's Law' : word frequency is (supposedly) inversely
> >proportional to the word's rank (1st, 2nd, 3rd most frequent, etc.).
> >Is the square root a consequence of Zipf's Law? Or is there
> more going
> >on?
> >
> >Peter Van Roy
>
>
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> Professor Sylviane Granger
> Université Catholique de Louvain
> Centre for English Corpus Linguistics
> Collège Erasme
> Place Blaise Pascal 1
> B-1348 Louvain-la-Neuve
> Belgium
> Fax: + 3210474942
> http://www.fltr.ucl.ac.be/FLTR/GERM/ETAN/CECL/cecl.html
>
>
More information about the Corpora
mailing list