Corpora: Number of distinct words
Robert Luk (COMP staff)
csrluk at comp.polyu.edu.hk
Mon Oct 29 01:42:40 UTC 2001
> > Dear list members,
There is a paper in the journal, Quantitative Linguistics, which
looks at the distribution of unique word lengths of a number of
Indo-European languages. Sorry I don't remember the
exact volume and number (but its near the initial ones).
You can then relate the word length
distribution with the file size as:
File Size = SUM_k [#(k) * (k+1)] = F (1)
~ mean word length * N (1.1)
where #(.) is the number of times the argument has appeared
and N is the total number of distinct words.
If the given relation:
N = 6 sqrt(F) => N^2 / 36 = F
is substituted into Eq 1.1, then
mean length of distinct word = N / 36
which does not sound right.
Suppose, we have 6,000 distinct words (i.e. N = 6,000),
then
F = 36,000,000 / 36 = 1 million bytes.
This sounds too big from what I know of file sizes of word
lists. The average word length of English is around 8, so
that 8 * 6,000 ~ 48k. May be I am missing something.
Best,
Robert Luk
> > Could anyone help me answer the following message which I've just received
> > from a colleague of mine in the Computer Science Department?
> >
> > Many thanks.
> >
> > Have a good day!
> > Sylviane Granger
> >
> > >Since about 1.5 years, a colleague and I have been writing a textbook
> > >on computer programming. I have kept numerous drafts of the book during
> > >this period. Today I was curious to see how these drafts evolved. I
> > >graphed the number of distinct 'words' (character sequences delimited
> > >by noncharacters) as a function of file size. I found that a good fit
> > >is given by the square root function:
> > >
> > > (number of distinct words) = 6 * sqrt(file size)
> > >
> > >Is this an example of a general law? I.e., if the text just repeated
> > >the same over and over the exponent would be zero. If the text was a
> > >long catalogue of facts the exponent would be one. The exponent is
> > >exactly half way in between. Is it because of the structure of the
> > >book (the effort to make it coherent)? I don't know. Any comments or
> > >reactions welcome!
> > >
> > >I know of 'Zipf's Law' : word frequency is (supposedly) inversely
> > >proportional to the word's rank (1st, 2nd, 3rd most frequent, etc.).
> > >Is the square root a consequence of Zipf's Law? Or is there more going
> > >on?
> > >
> > >Peter Van Roy
> >
> >
> > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> > Professor Sylviane Granger
> > Université Catholique de Louvain
> > Centre for English Corpus Linguistics
> > Collège Erasme
> > Place Blaise Pascal 1
> > B-1348 Louvain-la-Neuve
> > Belgium
> > Fax: + 3210474942
> > http://www.fltr.ucl.ac.be/FLTR/GERM/ETAN/CECL/cecl.html
> >
> >
> A strict application of the Zipf's Law implies that the number of
> words is proportional to the log of the file size.
> My impression is this is what happens if you take novels.
> Technical books may behave in a different way.
> Best regards
>
> Giorgio
> -------------------------------------------------------------------------
> Dipartimento di Fisica Fax +39-06-4463158
> Universita' di Roma "La Sapienza" giorgio.parisi at roma1.infn.it
> P.le A. Moro 2 Tel +39-06-49913481
> Roma, Italy, I-00185 http://chimera.roma1.infn.it/GIORGIO/giorgio.html
> ------------------------------------------------------------------------
>
>
>
More information about the Corpora
mailing list