[Corpora-List] entropy of text
Dinoj Surendran
dinoj at cs.uchicago.edu
Wed Feb 19 00:06:11 UTC 2003
Hello everyone,
Suppose you have a text involving C character types and N character tokens
(so for a large book C would be under 50 and N several thousands/millions)
and you want to compute the entropy of the text. Suppose further that you're
doing this by finding the limit of H_k/k for large k, where H_k is the
entropy of k-grams of the text. Naturally you can't take k very large if N
is small.
Can anyone point me to some good references on how large one can take k to
be for a given C and N (and possibly other factors)? I'm looking at C=40
and N=80 000.
Thanks,
Dinoj Surendran
Graduate Student
Computer Science Dept
University of Chicago
PS - while I'm here, does anyone know of any online, freely available,
large (>50 000) corpora of phoneme-transcribed spontaneous conversation?
I've got the switchboard one for American English.
http://www.isip.msstate.edu/projects/switchboard/
which has 80 000 phonemes syllabified into about 30 000 syllables.
Similar corpora for any language would be useful.
More information about the Corpora
mailing list