[Corpora-List] Reducing n-gram output
Adam Lopez
alopez at inf.ed.ac.uk
Tue Oct 28 15:21:52 UTC 2008
> I was wondering whether anybody is aware of ideas and/or automated
> processes to reduce n-gram output by solving the common problem that
> shorter n-grams can be fragments of larger structures (e.g. the 5-
> gram 'at the end of the' as part of the 6-gram 'at the end of the
> day')
>
> I am only aware of Paul Rayson's work on c-grams (collapsed-grams).
Suffix trees, suffix arrays, and their relatives are compact data
structures following exactly the intuition that smaller strings are
substrings of larger ones. They represent all possible n-grams
(without limit on n) of a text in space proportional to the length of
the text and support efficient retrieval, counting, and other queries
on substrings of the text; there is a vast literature on their various
applications (and theory linking them to compressibility, etc.).
Whether they are useful for you depends on: 1) your application, and
2) whether your corpus fits in memory (the development of efficient
external data structures with these properties is an active research
area).
Adam
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list