[Corpora-List] What proportion of letter ngrams occur in English?

Simon King Simon.King at ed.ac.uk
Mon Jan 26 09:30:34 UTC 2004


Bruce L. Lambert, Ph.D. wrote:
> I am revisiting an issue I brought up to this list several years ago,
> that is, how many legal/pronounceable strings can be generated from a
> fixed alphabet for a string of a given length.

One approach to this might be to consider legal syllables; there  are
strong phonotactic constraints on valid onsets and codas, both on
allowed sequences and on total number of segments, which mean there are
only a few thousand allowable syllables in English out of hundreds of
thousands of possible phoneme sequences.

Of course, this is not in terms of character strings. But, for made-up
words like drug names I would guess the letter-to-sound corespondence
would be much more regular than for real words, so it would still work.

Simon

--
Dr. Simon King                               Simon.King at ed.ac.uk
Centre for Speech Technology Research          www.cstr.ed.ac.uk
For MSc/PhD info, visit  www.hcrc.ed.ac.uk/language-at-edinburgh



More information about the Corpora mailing list