[Corpora-List] Reducing n-gram output

Adam Lopez alopez at inf.ed.ac.uk
Tue Oct 28 15:21:52 UTC 2008


> I was wondering whether anybody is aware of ideas and/or automated  
> processes to reduce n-gram output by solving the common problem that  
> shorter n-grams can be fragments of larger structures (e.g. the 5- 
> gram 'at the end of the' as part of the 6-gram 'at the end of the  
> day')
>
> I am only aware of Paul Rayson's work on c-grams (collapsed-grams).

Suffix trees, suffix arrays, and their relatives are compact data  
structures following exactly the intuition that smaller strings are  
substrings of larger ones. They represent all possible n-grams  
(without limit on n) of a text in space proportional to the length of  
the text and support efficient retrieval, counting, and other queries  
on substrings of the text; there is a vast literature on their various  
applications (and theory linking them to compressibility, etc.).

Whether they are useful for you depends on: 1) your application, and  
2) whether your corpus fits in memory (the development of efficient  
external data structures with these properties is an active research  
area).


Adam

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list