[Corpora-List] Reducing n-gram output

Tue Oct 28 13:20:36 UTC 2008

Irina,

There are compression algorithms, such as LZW, that work on this  
principle. In theory you can "optimally" compress a text by  
recursively storing all strings as combinations of the most common  
substrings and so on, and using a minimal number of bits for the most  
common substrings (huffman encoding). In practice its a bit  
problematic because you don't know what the most common substrings are  
until you've examined the entire text, so efficient compression  
algorithms based on this principle tend to choose the substrings by  
which to represent a string based only on what they've seen so far,  
and settle for being sub-optimal in exchange for working in linear  
time and with limited working memory requirements.

A benefit of this general approach is that the method's predictive  
power, and hence the compression ratio, increases with the more text  
you store, as language tends to be re-used. In the extreme case, for  
example, you can encode multiple revisions of the same document  
incredibly efficiently, as an edited document is made up mostly of  
large chunks of the previous version which you have already stored.

There is code out there for LZW compression and its relatives, and you  
could quite easily implement your own, adapted towards the  
language-oriented tasks I imagine that you have in mind.

 From a language modelling persepective, it would be preferable to go  
the whole-hog and ascertain frequencies of every n-gram from n=1  
upwards, prior to compression. If you start at the character level,  
rather than the word level, then you get morphological analysis for  
free! You would end up with a nifty language model from which you  
could estimate the probability of a given (unseen) string, or identify  
morphologically remarkable words or unusual idiomatic phrases, by  
comparing their frequency of occurrence with the combined  
probabilities of their most probable substrings. It might be even more  
pertinent to sum the probabilities of *all* possible substring  
interpretations of a string, but to do that you are probably looking  
at a huge computational task.

Regards,

Justin Washtell
University of Leeds

Quoting Dahlmann Irina <aexid at nottingham.ac.uk>:

> Dear all,
>
> I was wondering whether anybody is aware of ideas and/or automated
> processes to reduce n-gram output by solving the common problem that
> shorter n-grams can be fragments of larger structures (e.g. the 5-gram
> 'at the end of the' as part of the 6-gram 'at the end of the day')
>
> I am only aware of Paul Rayson's work on c-grams (collapsed-grams).
>
> Many thanks,
>
> Irina Dahlmann
>
> PhD student
> School of English Studies
> University of Nottingham
> aexid at nottingham.ac.uk
>
> This message has been checked for viruses but the contents of an attachment
> may still contain software viruses, which could damage your computer system:
> you are advised to perform your own checks. Email communications with the
> University of Nottingham may be monitored as permitted by UK legislation.
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora