[Corpora-List] entropy of text

Wed Feb 19 00:06:11 UTC 2003

Hello everyone,

Suppose you have a text involving C character types and N character tokens
(so for a large book C would be under 50 and N several thousands/millions)
and you want to compute the entropy  of the text. Suppose further that you're
doing this by finding the limit of H_k/k for large k, where H_k is the
entropy of k-grams of the text. Naturally you can't take k very large if N
is small.

Can anyone point me to some good references on how large one can take k to
be for a given C and N (and possibly other factors)? I'm looking at C=40
and N=80 000.

Thanks,

Dinoj Surendran
Graduate Student
Computer Science Dept
University of Chicago

PS - while I'm here, does anyone know of any online, freely available,
large (>50 000) corpora of phoneme-transcribed spontaneous conversation?

I've got the switchboard one for American English.
http://www.isip.msstate.edu/projects/switchboard/
which has 80 000 phonemes syllabified into about 30 000 syllables.

Similar corpora for any language would be useful.