[Corpora-List] Re: enquiry

Timothy Baldwin tbaldwin at csli.stanford.edu
Fri Nov 14 20:50:39 UTC 2003


Hi,


> My research area is information theory.
> I have some questions in the scope of your profession and do appreciate if
> you help me.
> I know that the frequency of the <<<WORDS>>> in natural languages can be
> modeled by Zipf's law and there are lots of works in this regard..
> But I am looking for a counterpart for <<<LETTER FREQUENCY>>> in natural
> languages.
>
> Does Zipf law hold for letter frequency as well?
>
> Is there any universal model for letter frequency in natural
> languages(Something like Zipf law)?
>
> Is there any universal model for letter frequency in natural
> languages(Something like Zipf law)?
>
> If so, what are the basic references for this matter?
>
> How can I find the letter frequency for natural languages?

I'm not familiar with any work on letter frequencies, but think that you would
be likely to observe Zipfian effects in ideogram-based languages such as
Chinese and English, where the boundary between characters and words is pretty
fuzzy to begin with. Certainly in looking briefly at English character
distributions in the WSJ and Brown corpora, the letter distribution is pretty
linear, but if you then go on to look at N-grams of different order, Zipfian
effects become more and more pronounced for higher values of N
(unsurprisingly). I can send on the graphs if you are interested in having a
look.

I have taken the liberty of forwarding this message to the CORPORA mailing
list to see if anyone in the wider community has to say anything on the
subject. I recommend that you subscribe to the list
(http://helmer.aksis.uib.no/corpora/welcome.txt) and have discussion of the
matter take place via the list.


Tim



More information about the Corpora mailing list