Corpora: grammar of English letter combinations

Bill Fisher william.fisher at nist.gov
Wed May 24 17:08:24 UTC 2000


  Doing some clean-up today, I ran across some
work that I did in February of last year that
looks very similar to Geoffrey's; sorry I didn't
remember it when his first query went out.

  I used a large union pronlex (~400k entries) that
we have put together here as the source of word
spellings, which were written out into a corpus file
with a space between the letters.  This was then fed
into standard utilities in the CMU-Cambridge
statistical language model toolkit to produce a
tri-gram backed-off language model.  A program of
mine then generated "sentences" randomly but
respecting the probabilities of each successive
"word" choice.  Here are some of the better non-
English (afaik) results, preceded by their estimated
probabilities per letter:

# PR/NWORDS   SENTENCE ...
0.0134141155  [ <s> p e d </s> ] ("ped")
0.0044879812  [ <s> p o n </s> ] ("pon")
0.0038318857  [ <s> a c k </s> ] ("ack")
0.0020768188  [ <s> z o </s> ] ("zo")
0.0017731462  [ <s> a s t </s> ] ("ast")
0.0011737253  [ <s> p r i n g </s> ] ("pring")
0.0005763704  [ <s> c o m y </s> ] ("comy")
0.0003478832  [ <s> w e l l y </s> ] ("welly")
0.0002926426  [ <s> g l i n g </s> ] ("gling")
0.0002905756  [ <s> w o o n </s> ] ("woon")
0.0001774358  [ <s> c o r t s </s> ] ("corts")
0.0001257791  [ <s> t r a n d </s> ] ("trand")
0.0000691862  [ <s> f l a d </s> ] ("flad")
0.0000521828  [ <s> d e c t i o n </s> ] ("dection")
0.0000517939  [ <s> u n k i n g </s> ] ("unking")
0.0000360357  [ <s> m i s l y </s> ] ("misly")
0.0000355770  [ <s> d e n t i o n </s> ] ("dention")
0.0000339071  [ <s> s a r i c </s> ] ("saric")
0.0000275131  [ <s> h a n c h </s> ] ("hanch")
0.0000201202  [ <s> h a i s m </s> ] ("haism")
0.0000125679  [ <s> p a r g e ' s </s> ] ("parge's")
0.0000069366  [ <s> t u t i c </s> ] ("tutic")
0.0000055470  [ <s> p e n i s m </s> ] ("penism")
0.0000054817  [ <s> h o r t l y </s> ] ("hortly")
0.0000050649  [ <s> r e - o f f </s> ] ("re-off")
0.0000030477  [ <s> p y r o f f </s> ] ("pyroff")
0.0000021522  [ <s> m a b s t </s> ] ("mabst")
0.0000010393  [ <s> w h a r c h </s> ] ("wharch")
0.0000009782  [ <s> c h e m i s m ' s </s> ] ("chemism's")
0.0000006504  [ <s> f a l l i d </s> ] ("fallid")
0.0000006480  [ <s> d e l u c k </s> ] ("deluck")
0.0000001139  [ <s> f r i b i o n s </s> ] ("fribions")
0.0000000390  [ <s> e x p a g e l </s> ] ("expagel")
0.0000000183  [ <s> p s i o l e s ' </s> ] ("psioles'")
0.0000000168  [ <s> v a x i n a </s> ] ("vaxina")
0.0000000038  [ <s> c a t t r o m e d </s> ] ("cattromed")
0.0000000021  [ <s> n a t i c i v i n g </s> ] ("naticiving")
0.0000000011  [ <s> h a f t - k o </s> ] ("haft-ko")
0.0000000004  [ <s> d i t h e = o u t </s> ] ("dithe=out")
0.0000000001  [ <s> b o y a t o r g e d </s> ] ("boyatorged")
0.0000000000  [ <s> m e e b r a i r w a r s t s </s> ] ("meebrairwarsts")

  And some of the bad ones were more interesting, such as:

0.0007833181  [ <s> m c k </s> ] ("mck")

  This is probably due to the ngram model's limited memory
for context. "#mc", "mck", and "ck#" all seem fairly common;
but put them together, and you get something that's impossible.

 But we may be delucked by the fribions re-offing and going hortly fallid.

 - Bill F.



More information about the Corpora mailing list