Corpora: Number of distinct words

Robert Luk (COMP staff) csrluk at comp.polyu.edu.hk
Mon Oct 29 10:13:16 UTC 2001


Hi Peter,

> Thank you for your analysis.  I have just a few remarks.
>
> >You can then relate the word length
> >distribution with the file size as:
> >
> >File Size = SUM_k [#(k) * (k+1)] = F    (1)
> >           ~ mean word length * N        (1.1)
> >
> >where #(.) is the number of times the argument has appeared
> >and N is the total number of distinct words.
> >
> >If the given relation:
> >
> >N = 6 sqrt(F) => N^2 / 36 = F
>
> This relation holds between different drafts of the
> same file (study of a text during its composition).
> Another particularity is that the text measured is
> a textbook, which likely has a structure very
> different from a novel.  Does your formula take these
> two considerations into account?

Sorry for the misunderstanding. But what
follows would be relevant.

> FYI, the text has 8084 distinct words for a file size
> of 1835191 characters.

For naturally occurring text, Heap's law says the following
form:

N = A F^B

where N and F are as defined above, B is
between 0 and 1, and A is another constant. I am not
sure whether A has to be between 0 and 1 or somewhere outside.
If A can be larger than 1, then I guess what you have is basically
Heap's law.

Best,

Robert Luk

> Peter
>
> --
> Peter Van Roy
> Département d'Ingénierie Informatique
> (Department of Computing Science and Engineering)
> Université catholique de Louvain
> B-1348 Louvain-la-Neuve, Belgium
>
> Email: pvr at info.ucl.ac.be
> Tel: (+32) (10) 47.83.74
> Web: http://www.info.ucl.ac.be/people/cvvanroy.html
> Mozart: http://www.mozart-oz.org
>
>



More information about the Corpora mailing list