Corpora: Number of distinct words

Robert Luk (COMP staff) csrluk at comp.polyu.edu.hk
Mon Oct 29 01:42:40 UTC 2001


> > Dear list members,

There is a paper in the journal, Quantitative Linguistics, which
looks at the distribution of unique word lengths of a number of
Indo-European languages. Sorry I don't remember the
exact volume and number (but its near the initial ones).
You can then relate the word length
distribution with the file size as:

File Size = SUM_k [#(k) * (k+1)] = F	(1)
          ~ mean word length * N	(1.1)

where #(.) is the number of times the argument has appeared
and N is the total number of distinct words.

If the given relation:

N = 6 sqrt(F) => N^2 / 36 = F

is substituted into Eq 1.1, then

mean length of distinct word = N / 36

which does not sound right.

Suppose, we have 6,000 distinct words (i.e. N = 6,000),
then

F = 36,000,000 / 36 = 1 million bytes.

This sounds too big from what I know of file sizes of word
lists. The average word length of English is around 8, so
that 8 * 6,000 ~ 48k. May be I am missing something.

Best,

Robert Luk

> > Could anyone help me answer the following message which I've just received
> > from a colleague of mine in the Computer Science Department?
> >
> > Many thanks.
> >
> > Have a good day!
> > Sylviane Granger
> >
> > >Since about 1.5 years, a colleague and I have been writing a textbook
> > >on computer programming. I have kept numerous drafts of the book during
> > >this period. Today I was curious to see how these drafts evolved. I
> > >graphed the number of distinct 'words' (character sequences delimited
> > >by noncharacters) as a function of file size.  I found that a good fit
> > >is given by the square root function:
> > >
> > >   (number of distinct words) = 6 * sqrt(file size)
> > >
> > >Is this an example of a general law?  I.e., if the text just repeated
> > >the same over and over the exponent would be zero.  If the text was a
> > >long catalogue of facts the exponent would be one.  The exponent is
> > >exactly half way in between.  Is it because of the structure of the
> > >book (the effort to make it coherent)?  I don't know.  Any comments or
> > >reactions welcome!
> > >
> > >I know of 'Zipf's Law' : word frequency is (supposedly) inversely
> > >proportional to the word's rank (1st, 2nd, 3rd most frequent, etc.).
> > >Is the square root a consequence of Zipf's Law?  Or is there more going
> > >on?
> > >
> > >Peter Van Roy
> >
> >
> > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> > Professor Sylviane Granger
> > Université Catholique de Louvain
> > Centre for English Corpus Linguistics
> > Collège Erasme
> > Place Blaise Pascal 1
> > B-1348 Louvain-la-Neuve
> > Belgium
> > Fax: + 3210474942
> > http://www.fltr.ucl.ac.be/FLTR/GERM/ETAN/CECL/cecl.html
> >
> >
> A strict application of the Zipf's Law implies that the number of
> words is proportional to the log of the file size.
> My impression is this is what happens if you take novels.
> Technical books may behave in a different way.
> Best regards
>
> Giorgio
> -------------------------------------------------------------------------
> Dipartimento di Fisica                        Fax +39-06-4463158
> Universita' di Roma "La Sapienza"             giorgio.parisi at roma1.infn.it
> P.le A. Moro 2                                Tel +39-06-49913481
> Roma, Italy, I-00185     http://chimera.roma1.infn.it/GIORGIO/giorgio.html
> ------------------------------------------------------------------------
>
>
>



More information about the Corpora mailing list