[Corpora-List] Reducing n-gram output

Tue Oct 28 14:19:20 UTC 2008

> I was wondering whether anybody is aware of ideas and/or automated
> processes to reduce n-gram output by solving the common problem that
> shorter n-grams can be fragments of larger structures (e.g. the 5-gram
> 'at the end of the' as part of the 6-gram 'at the end of the day')
>
> I am only aware of Paul Rayson's work on c-grams (collapsed-grams).
The technical problem in this (looking if there are n-grams with larger n that 
contain this substring) is not really complicated - so the essential question 
is what you want to achieve with it, and this would give you an idea about 
criteria you use to discard smaller-n n-grams.

Based on statistics like frequency, mutual information, or distribution of the 
n-grams, you could discard the smaller-n n-gram if:
* its frequency is equal to that of the larger-n n-gram (i.e., all occurrences 
of the smaller n-gram are actually part of the larger n-gram in the corpus)
* its frequency is greater that (some value)*the frequency of the larger 
n-gram (e.g., at least 80% of the smaller n-gram occurrences are part of the 
larger n-gram)
* if the mutual information for the larger n-gram is greater than for the 
smaller n-gram plus some adjustment
I think it really makes sense to (a) go the whole way and approximate what you 
want as well as you reasonably can and (b) explicitly reason about what you 
are approximating with it, since data-driven approaches like this can easily 
lead onto the slippery slope to cargo-cult science where people blindly use 
nontrivial tool X to achieve a simple problem Y that actually has good 
solutions somewhere else (e.g, X=compression programs, Y=language modeling, 
where the speech community has been working for decades on n-gram- and 
syntax-based language models which also do a much better job at it).

You might want to look at the research of Douglas Biber, who uses n-grams with 
some additional information and calls them "lexical bundles".
e.g.: http://applij.oxfordjournals.org/cgi/content/abstract/25/3/371
Biber/Conrad/Cortes "If you look at ...: Lexical Bundles in University 
Teaching and Textbooks"

Best wishes,
-- 
Yannick Versley
Seminar für Sprachwissenschaft, Abt. Computerlinguistik
Wilhelmstr. 19, 72074 Tübingen
Tel.: (07071) 29 77352

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora