[Corpora-List] What proportion of letter ngrams occur in English?

Bruce L. Lambert, Ph.D. lambertb at uic.edu
Tue Jan 27 17:42:27 UTC 2004


Not at all disappointed by the responses. I know this is a difficult and
unanswered question. Perhaps I should have supplied more context when I
initially asked my question. So here goes.

It is unfortunately rather common for drug names with similar spellings or
pronunciations (e.g., Zantac/Xanax, Celebrex/Celexa/Cerebyx) to be confused
by doctors, nurses, pharmacists and patients. Often there confusions are
harmless, but sometimes they are fatal. By our best estimates, these "wrong
drug" errors occur several million times per year in the U.S.

One response is to ask drug companies to come up with less confusing names.
They claim that is nearly impossible because "there are only 26 letters"
and the space for distinct (non-confusing) new names is "running out." So
this is the crux of the issue. Is the space for new names running out? The
only way to say is to calculate something like the "capacity" of the name
space (given some assumptions, e.g., 8 letters or three syllables). There
are many ways to approach this, several of which have been alluded to in
responses to my initial query. I'd still like to hear more. I will
summarize them in a week or so.

Also, by "legal strings" I really only mean pronounceable strings. Since
many drug names are neologisms, we don't have to worry about violating any
other rules. As long as the name can be readily pronounced, it is a
candidate to be a drug name. (There are other constraints of course, that I
am not going into.)

-bruce

At 10:12 AM 1/27/2004 +0000, Geoff Sampson wrote:
>If you feel disappointed by what you have managed to find out to date, I think
>this is probably because you are seeing it as a question with a sharply
>defined
>(though unknown) answer:  a given sequence is either legal or illegal; whereas
>in fact it is a question of more or less natural, not black and white.  "Q
>must be followed by U" looks like a 100% English rule, but people interested
>in aromatherapy and allied trades these days are frequently using the word
>"qi" borrowed from Chinese -- they don't always or even usually italicize it
>as a foreign word, and if we said any words borrowed from other languages
>don't count we wouldn't have much English left.  Furthermore, the constraints
>are not just "local" but longer-range.  The sequence "io" is common enough
>in English, for instance in the suffix "-ation", but I think I'm right in
>saying that "io" will only occur in words based on Latin or other non-native
>roots; whereas the letter "w" will never occur in roots from Latin or Greek.
>So is "walition" a legal English word?  Each syllable looks normal enough, but
>as a linguist I would wonder "what could the etymology of that possibly be?"
>
>This doesn't make your question a meaningless one -- far from it.  But it is
>one to which the answer can only be a broad order of magnitude rather than
>an exact number, and it is much more complicated to estimate that figure than
>it might seem to be.  I don't know any place where someone has tried to do it;
>it is not obvious why an academic linguist would want to.
>
>
>Geoffrey Sampson  MA  PhD  MBCS  ILTM
>Professor of Natural Language Computing
>
>Department of Informatics
>University of Sussex
>Falmer, Brighton BN1 9QH, England
>
>t  +44 1273 678525
>f  +44 1273 671320
>w  www.grsampson.net
>
>e-mail address no longer shown to avoid spam flood



More information about the Corpora mailing list