[Corpora-List] ANC, FROWN, Fuzzy Logic

John Goldsmith goldsmith at uchicago.edu
Mon Jul 24 19:05:49 UTC 2006


Daoud Clarke wrote:
>It would be extremely interesting however to see whether the use of
>linguistic theories can help provide better text compression. To my
>awareness this has not been looked into.

Several researchers have used improvement in total description length as the
result of morphological analysis to justify the existence of morphology
(including me: see my paper in Computational Linguistics in 2001, and our
website at linguistica.uchicago.edu). At a crude level, it is clear that the
redundancy in lists of words -- for example, treating jumps, jumped,
jumping, laughs, laughed, laughing all as separate and unrelated words in
the lexicon of English -- leads to a longer description of an English corpus
than one in which there is a list of stems and affixes, and some machinery
that explicitly indicates how they may be composed in the language in
question. The devil is in the details, and there has been a lot of work in
this area over the last half dozen years. 
John Goldsmith 



More information about the Corpora mailing list