Corpora: grammar of English letter combinations
Bill Fisher
william.fisher at nist.gov
Wed May 24 17:08:24 UTC 2000
Doing some clean-up today, I ran across some
work that I did in February of last year that
looks very similar to Geoffrey's; sorry I didn't
remember it when his first query went out.
I used a large union pronlex (~400k entries) that
we have put together here as the source of word
spellings, which were written out into a corpus file
with a space between the letters. This was then fed
into standard utilities in the CMU-Cambridge
statistical language model toolkit to produce a
tri-gram backed-off language model. A program of
mine then generated "sentences" randomly but
respecting the probabilities of each successive
"word" choice. Here are some of the better non-
English (afaik) results, preceded by their estimated
probabilities per letter:
# PR/NWORDS SENTENCE ...
0.0134141155 [ <s> p e d </s> ] ("ped")
0.0044879812 [ <s> p o n </s> ] ("pon")
0.0038318857 [ <s> a c k </s> ] ("ack")
0.0020768188 [ <s> z o </s> ] ("zo")
0.0017731462 [ <s> a s t </s> ] ("ast")
0.0011737253 [ <s> p r i n g </s> ] ("pring")
0.0005763704 [ <s> c o m y </s> ] ("comy")
0.0003478832 [ <s> w e l l y </s> ] ("welly")
0.0002926426 [ <s> g l i n g </s> ] ("gling")
0.0002905756 [ <s> w o o n </s> ] ("woon")
0.0001774358 [ <s> c o r t s </s> ] ("corts")
0.0001257791 [ <s> t r a n d </s> ] ("trand")
0.0000691862 [ <s> f l a d </s> ] ("flad")
0.0000521828 [ <s> d e c t i o n </s> ] ("dection")
0.0000517939 [ <s> u n k i n g </s> ] ("unking")
0.0000360357 [ <s> m i s l y </s> ] ("misly")
0.0000355770 [ <s> d e n t i o n </s> ] ("dention")
0.0000339071 [ <s> s a r i c </s> ] ("saric")
0.0000275131 [ <s> h a n c h </s> ] ("hanch")
0.0000201202 [ <s> h a i s m </s> ] ("haism")
0.0000125679 [ <s> p a r g e ' s </s> ] ("parge's")
0.0000069366 [ <s> t u t i c </s> ] ("tutic")
0.0000055470 [ <s> p e n i s m </s> ] ("penism")
0.0000054817 [ <s> h o r t l y </s> ] ("hortly")
0.0000050649 [ <s> r e - o f f </s> ] ("re-off")
0.0000030477 [ <s> p y r o f f </s> ] ("pyroff")
0.0000021522 [ <s> m a b s t </s> ] ("mabst")
0.0000010393 [ <s> w h a r c h </s> ] ("wharch")
0.0000009782 [ <s> c h e m i s m ' s </s> ] ("chemism's")
0.0000006504 [ <s> f a l l i d </s> ] ("fallid")
0.0000006480 [ <s> d e l u c k </s> ] ("deluck")
0.0000001139 [ <s> f r i b i o n s </s> ] ("fribions")
0.0000000390 [ <s> e x p a g e l </s> ] ("expagel")
0.0000000183 [ <s> p s i o l e s ' </s> ] ("psioles'")
0.0000000168 [ <s> v a x i n a </s> ] ("vaxina")
0.0000000038 [ <s> c a t t r o m e d </s> ] ("cattromed")
0.0000000021 [ <s> n a t i c i v i n g </s> ] ("naticiving")
0.0000000011 [ <s> h a f t - k o </s> ] ("haft-ko")
0.0000000004 [ <s> d i t h e = o u t </s> ] ("dithe=out")
0.0000000001 [ <s> b o y a t o r g e d </s> ] ("boyatorged")
0.0000000000 [ <s> m e e b r a i r w a r s t s </s> ] ("meebrairwarsts")
And some of the bad ones were more interesting, such as:
0.0007833181 [ <s> m c k </s> ] ("mck")
This is probably due to the ngram model's limited memory
for context. "#mc", "mck", and "ck#" all seem fairly common;
but put them together, and you get something that's impossible.
But we may be delucked by the fribions re-offing and going hortly fallid.
- Bill F.
More information about the Corpora
mailing list