[Corpora-List] What proportion of letter ngrams occur in English?

Bruce L. Lambert, Ph.D. lambertb at uic.edu
Fri Jan 23 21:15:12 UTC 2004


I am revisiting an issue I brought up to this list several years ago, that
is, how many legal/pronounceable strings can be generated from a fixed
alphabet for a string of a given length. For example, in the U.S., the
average drug name is 8 characters long. Given an alphabet of 26 letters and
8 sequential positions in the string, there are 26^8 possible strings. What
proportion of these would actually be legal, pronounceable strings in
English? It strikes me that, because of the strong sequential constraints
on English orthography (and phonology), that the pronounceable set is much,
much, much smaller than the entire set of possible strings. But can we
quantify this?

A related question: Of the 676 letter bigrams that can be constructed from
a 26 letter alphabet, how many actually occur in English? Of the 17576
letter trigrams that can be constructed from the English alphabet, how many
actually occur?

Is there a list of "legal" letter ngrams and/or phoneme ngrams? How can I
learn more about these sequential constraints?

-bruce



More information about the Corpora mailing list