[Corpora-List] entropy of text

J R Elliott jre at comp.leeds.ac.uk
Wed Feb 19 17:51:51 UTC 2003


Dinoj,

I have published many papers now, which includes the entropy values and
characteristics of many languages representing most of the languages
families.

I assume you are referring to the binomial equation for the maximum
sample size appropriate for a given entropic order.

My web site lists recent publications but the most recent relevant paper,
which
includes this except is:
Elliott, John. Detecting Languageness in: Proceedings of 6th World
Multi-Conference on Systemics, Cybernetics and Informatics (SCI 2002),
IX, pp. 323-328. 2002, Orlando, Florida, USA.

To calculate the sample size required for a given entropic order,
the binomial equation is: N(r) = n!/r!(n - r)!,
where n = the number of symbols or patterns
and r = the entropic order.

Hope this helps,

John
*********************************************************
John Elliott
Centre for Computer Analysis of Language and Speech
University of Leeds.  http://www.comp.leeds.ac.uk/jre/
and Computational Intelligence Group, School of Computing
Leeds Metropolitan University
email:  jre at comp.leeds.ac.uk  or J.Elliott at lmu.ac.uk
Home: 0113 286 6517 john.elliott at leedsalumni.org.uk
*********************************************************



On Tue, 18 Feb 2003, Dinoj Surendran wrote:

> Hello everyone,
>
> Suppose you have a text involving C character types and N character tokens
> (so for a large book C would be under 50 and N several thousands/millions)
> and you want to compute the entropy  of the text. Suppose further that you're
> doing this by finding the limit of H_k/k for large k, where H_k is the
> entropy of k-grams of the text. Naturally you can't take k very large if N
> is small.
>
> Can anyone point me to some good references on how large one can take k to
> be for a given C and N (and possibly other factors)? I'm looking at C=40
> and N=80 000.
>
> Thanks,
>
> Dinoj Surendran
> Graduate Student
> Computer Science Dept
> University of Chicago
>
> PS - while I'm here, does anyone know of any online, freely available,
> large (>50 000) corpora of phoneme-transcribed spontaneous conversation?
>
> I've got the switchboard one for American English.
> http://www.isip.msstate.edu/projects/switchboard/
> which has 80 000 phonemes syllabified into about 30 000 syllables.
>
> Similar corpora for any language would be useful.
>
>
>

--
*********************************************************
John Elliott
Centre for Computer Analysis of Language and Speech
University of Leeds.  http://www.comp.leeds.ac.uk/jre/
and Computational Intelligence Group, School of Computing
Leeds Metropolitan University
email:  jre at comp.leeds.ac.uk  or J.Elliott at lmu.ac.uk
Home: 0113 286 6517 john.elliott at leedsalumni.org.uk
*********************************************************



More information about the Corpora mailing list