Corpora: Number of distinct words
Robert Luk (COMP staff)
csrluk at comp.polyu.edu.hk
Mon Oct 29 10:13:16 UTC 2001
Hi Peter,
> Thank you for your analysis. I have just a few remarks.
>
> >You can then relate the word length
> >distribution with the file size as:
> >
> >File Size = SUM_k [#(k) * (k+1)] = F (1)
> > ~ mean word length * N (1.1)
> >
> >where #(.) is the number of times the argument has appeared
> >and N is the total number of distinct words.
> >
> >If the given relation:
> >
> >N = 6 sqrt(F) => N^2 / 36 = F
>
> This relation holds between different drafts of the
> same file (study of a text during its composition).
> Another particularity is that the text measured is
> a textbook, which likely has a structure very
> different from a novel. Does your formula take these
> two considerations into account?
Sorry for the misunderstanding. But what
follows would be relevant.
> FYI, the text has 8084 distinct words for a file size
> of 1835191 characters.
For naturally occurring text, Heap's law says the following
form:
N = A F^B
where N and F are as defined above, B is
between 0 and 1, and A is another constant. I am not
sure whether A has to be between 0 and 1 or somewhere outside.
If A can be larger than 1, then I guess what you have is basically
Heap's law.
Best,
Robert Luk
> Peter
>
> --
> Peter Van Roy
> Département d'Ingénierie Informatique
> (Department of Computing Science and Engineering)
> Université catholique de Louvain
> B-1348 Louvain-la-Neuve, Belgium
>
> Email: pvr at info.ucl.ac.be
> Tel: (+32) (10) 47.83.74
> Web: http://www.info.ucl.ac.be/people/cvvanroy.html
> Mozart: http://www.mozart-oz.org
>
>
More information about the Corpora
mailing list