[Corpora-List] What proportion of letter ngrams occur in English?
William Fletcher
fletcher at usna.edu
Mon Feb 2 13:55:21 UTC 2004
To answer this question I used an unreleased version of kfNgram to find
all 2- and 3-chargrams in "words" occurring 15 or more times in the BNC,
where word is defined as "sequence of alphabetic characters". There
are:
648 2-chargrams
7,781 3-chargrams
Many of the combinations probably do not reflect "legal" English
sequences, as there are abbreviations and foreign words in the corpus.
To help determine which sequences are most common (and thus most
English) I have made lists sorted by descending frequency as well as
alphabetically, with data on frequency in types and in tokens. Put a
cutoff point where you wish. The lists are available in a zip archive
on both my sites:
Phrases in English
http://pie.usna.edu/BNCCharGrams.zip
KWiCFinder
http://kwicfinder.com/BNCCharGrams.zip
Sorry you had to wait years for an answer to the easy part of your
query, Bruce! I look forward to your analysis of the data.
Bill Fletcher
AssocProf William H. Fletcher
Language Studies Department
United States Naval Academy
Annapolis MD 21402 5030
410-293-6362 [voice]
410-293-2729 [fax]
Department
http://usna.edu/LangStudy/
Phrases in English
http://pie.usna.edu/
KWiCFinder
http://kwicfinder.com/
>>> "Bruce L. Lambert, Ph.D." <lambertb at uic.edu> 1/23/2004 4:15:12 PM
>>>
I am revisiting an issue I brought up to this list several years ago,
that
is, how many legal/pronounceable strings can be generated from a fixed
alphabet for a string of a given length. For example, in the U.S., the
average drug name is 8 characters long. Given an alphabet of 26 letters
and
8 sequential positions in the string, there are 26^8 possible strings.
What
proportion of these would actually be legal, pronounceable strings in
English? It strikes me that, because of the strong sequential
constraints
on English orthography (and phonology), that the pronounceable set is
much,
much, much smaller than the entire set of possible strings. But can we
quantify this?
A related question: Of the 676 letter bigrams that can be constructed
from
a 26 letter alphabet, how many actually occur in English? Of the 17576
letter trigrams that can be constructed from the English alphabet, how
many
actually occur?
Is there a list of "legal" letter ngrams and/or phoneme ngrams? How can
I
learn more about these sequential constraints?
-bruce
More information about the Corpora
mailing list