[Corpora-List] What proportion of letter ngrams occur in English?

Mon Feb 2 13:55:21 UTC 2004

To answer this question I used an unreleased version of kfNgram to find
all 2- and 3-chargrams in "words" occurring 15 or more times in the BNC,
where word is defined as "sequence of alphabetic characters". There
are:
  648 2-chargrams
  7,781 3-chargrams

Many of the combinations probably do not reflect "legal" English
sequences, as there are abbreviations and foreign words in the corpus.

To help determine which sequences are most common (and thus most
English) I have made lists sorted by descending frequency as well as
alphabetically, with data on frequency in types and in tokens. Put a
cutoff point where you wish.  The lists are available in a zip archive
on both my sites:
Phrases in English
   http://pie.usna.edu/BNCCharGrams.zip

KWiCFinder
   http://kwicfinder.com/BNCCharGrams.zip

Sorry you had to wait years for an answer to the easy part of your
query, Bruce!  I look forward to your analysis of the data.

Bill Fletcher

AssocProf William H. Fletcher
Language Studies Department
United States Naval Academy
Annapolis MD 21402 5030

410-293-6362 [voice]
410-293-2729 [fax]
Department
   http://usna.edu/LangStudy/
Phrases in English
   http://pie.usna.edu/
KWiCFinder
   http://kwicfinder.com/

>>> "Bruce L. Lambert, Ph.D." <lambertb at uic.edu> 1/23/2004 4:15:12 PM
>>>
I am revisiting an issue I brought up to this list several years ago,
that
is, how many legal/pronounceable strings can be generated from a fixed

alphabet for a string of a given length. For example, in the U.S., the

average drug name is 8 characters long. Given an alphabet of 26 letters
and
8 sequential positions in the string, there are 26^8 possible strings.
What
proportion of these would actually be legal, pronounceable strings in
English? It strikes me that, because of the strong sequential
constraints
on English orthography (and phonology), that the pronounceable set is
much,
much, much smaller than the entire set of possible strings. But can we

quantify this?

A related question: Of the 676 letter bigrams that can be constructed
from
a 26 letter alphabet, how many actually occur in English? Of the 17576

letter trigrams that can be constructed from the English alphabet, how
many
actually occur?

Is there a list of "legal" letter ngrams and/or phoneme ngrams? How can
I
learn more about these sequential constraints?

-bruce