Corpora: grammar of English letter-sequences

Thu May 4 13:14:23 UTC 2000

Geoffrey Sampson wrote:

> Does anyone know of anything like a grammar of English letter-sequences --
> a system which generates the range of character-sequences which could
> plausibly occur as words of English, and a subset of which actually do?

 About a dozen years ago when I was working at TI I did some testing of
a regular grammar discovery procedure by using words from a dictionary as
sentences (letters=words).  I don't remember that anything very great
came of it; the hard problem remained of how to make the right
generalizations to unseen data.

 A year or so ago I did some experiments similar but not identical to
what you're interested in: I generated all valid word spellings (up to
a certain number of letters) by generating all letter sequences, running
each thru the best set of letter-to-phone rules I had, then testing each
resulting phone sequence for pronounceability by seeing if my
syllabification software could syllabify it with nothing left over.
If your definition of a grammar is any device that generates valid
sentences, I guess I was doing what you asked about.  But
the results were not great, probably because I'd trained up the TTP
rules on only positive examples.  So I trained up another set, this time
also including a large number of cases like "kkk => /k k k/", and the
results were better, but still not the kind I'd be proud to publish.

 - Bill Fisher