[Corpora-List] Reducing n-gram output
J Washtell
lec3jrw at leeds.ac.uk
Tue Oct 28 13:20:36 UTC 2008
Irina,
There are compression algorithms, such as LZW, that work on this
principle. In theory you can "optimally" compress a text by
recursively storing all strings as combinations of the most common
substrings and so on, and using a minimal number of bits for the most
common substrings (huffman encoding). In practice its a bit
problematic because you don't know what the most common substrings are
until you've examined the entire text, so efficient compression
algorithms based on this principle tend to choose the substrings by
which to represent a string based only on what they've seen so far,
and settle for being sub-optimal in exchange for working in linear
time and with limited working memory requirements.
A benefit of this general approach is that the method's predictive
power, and hence the compression ratio, increases with the more text
you store, as language tends to be re-used. In the extreme case, for
example, you can encode multiple revisions of the same document
incredibly efficiently, as an edited document is made up mostly of
large chunks of the previous version which you have already stored.
There is code out there for LZW compression and its relatives, and you
could quite easily implement your own, adapted towards the
language-oriented tasks I imagine that you have in mind.
From a language modelling persepective, it would be preferable to go
the whole-hog and ascertain frequencies of every n-gram from n=1
upwards, prior to compression. If you start at the character level,
rather than the word level, then you get morphological analysis for
free! You would end up with a nifty language model from which you
could estimate the probability of a given (unseen) string, or identify
morphologically remarkable words or unusual idiomatic phrases, by
comparing their frequency of occurrence with the combined
probabilities of their most probable substrings. It might be even more
pertinent to sum the probabilities of *all* possible substring
interpretations of a string, but to do that you are probably looking
at a huge computational task.
Regards,
Justin Washtell
University of Leeds
Quoting Dahlmann Irina <aexid at nottingham.ac.uk>:
> Dear all,
>
> I was wondering whether anybody is aware of ideas and/or automated
> processes to reduce n-gram output by solving the common problem that
> shorter n-grams can be fragments of larger structures (e.g. the 5-gram
> 'at the end of the' as part of the 6-gram 'at the end of the day')
>
> I am only aware of Paul Rayson's work on c-grams (collapsed-grams).
>
> Many thanks,
>
> Irina Dahlmann
>
> PhD student
> School of English Studies
> University of Nottingham
> aexid at nottingham.ac.uk
>
> This message has been checked for viruses but the contents of an attachment
> may still contain software viruses, which could damage your computer system:
> you are advised to perform your own checks. Email communications with the
> University of Nottingham may be monitored as permitted by UK legislation.
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list