Corpora: grammar of English letter-sequences
Bill Fisher
william.fisher at nist.gov
Thu May 4 13:14:23 UTC 2000
Geoffrey Sampson wrote:
> Does anyone know of anything like a grammar of English letter-sequences --
> a system which generates the range of character-sequences which could
> plausibly occur as words of English, and a subset of which actually do?
About a dozen years ago when I was working at TI I did some testing of
a regular grammar discovery procedure by using words from a dictionary as
sentences (letters=words). I don't remember that anything very great
came of it; the hard problem remained of how to make the right
generalizations to unseen data.
A year or so ago I did some experiments similar but not identical to
what you're interested in: I generated all valid word spellings (up to
a certain number of letters) by generating all letter sequences, running
each thru the best set of letter-to-phone rules I had, then testing each
resulting phone sequence for pronounceability by seeing if my
syllabification software could syllabify it with nothing left over.
If your definition of a grammar is any device that generates valid
sentences, I guess I was doing what you asked about. But
the results were not great, probably because I'd trained up the TTP
rules on only positive examples. So I trained up another set, this time
also including a large number of cases like "kkk => /k k k/", and the
results were better, but still not the kind I'd be proud to publish.
- Bill Fisher
More information about the Corpora
mailing list