[Corpora-List] Reducing n-gram output
Yannick Versley
versley at sfs.uni-tuebingen.de
Tue Oct 28 14:19:20 UTC 2008
> I was wondering whether anybody is aware of ideas and/or automated
> processes to reduce n-gram output by solving the common problem that
> shorter n-grams can be fragments of larger structures (e.g. the 5-gram
> 'at the end of the' as part of the 6-gram 'at the end of the day')
>
> I am only aware of Paul Rayson's work on c-grams (collapsed-grams).
The technical problem in this (looking if there are n-grams with larger n that
contain this substring) is not really complicated - so the essential question
is what you want to achieve with it, and this would give you an idea about
criteria you use to discard smaller-n n-grams.
Based on statistics like frequency, mutual information, or distribution of the
n-grams, you could discard the smaller-n n-gram if:
* its frequency is equal to that of the larger-n n-gram (i.e., all occurrences
of the smaller n-gram are actually part of the larger n-gram in the corpus)
* its frequency is greater that (some value)*the frequency of the larger
n-gram (e.g., at least 80% of the smaller n-gram occurrences are part of the
larger n-gram)
* if the mutual information for the larger n-gram is greater than for the
smaller n-gram plus some adjustment
I think it really makes sense to (a) go the whole way and approximate what you
want as well as you reasonably can and (b) explicitly reason about what you
are approximating with it, since data-driven approaches like this can easily
lead onto the slippery slope to cargo-cult science where people blindly use
nontrivial tool X to achieve a simple problem Y that actually has good
solutions somewhere else (e.g, X=compression programs, Y=language modeling,
where the speech community has been working for decades on n-gram- and
syntax-based language models which also do a much better job at it).
You might want to look at the research of Douglas Biber, who uses n-grams with
some additional information and calls them "lexical bundles".
e.g.: http://applij.oxfordjournals.org/cgi/content/abstract/25/3/371
Biber/Conrad/Cortes "If you look at ...: Lexical Bundles in University
Teaching and Textbooks"
Best wishes,
--
Yannick Versley
Seminar für Sprachwissenschaft, Abt. Computerlinguistik
Wilhelmstr. 19, 72074 Tübingen
Tel.: (07071) 29 77352
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list