[Corpora-List] Reducing n-gram output
Detmar Meurers
dm at ling.ohio-state.edu
Tue Oct 28 20:17:33 UTC 2008
Dear Irina,
I was wondering whether anybody is aware of ideas and/or automated
processes to reduce n-gram output by solving the common problem that
shorter n-grams can be fragments of larger structures (e.g. the 5-gram
'at the end of the' as part of the 6-gram 'at the end of the day')
on http://decca.osu.edu you can find the Python code Markus Dickinson,
Adriane Boyd and I used for detecting errors in corpus annotation,
which implements a version of the a priori algorithm to efficiently
compute the longest recurring n-grams in a corpus. There also are
some papers there discussing the algorithm (the EACL'03 paper is
probably best since the annotations are irrelevant for your purposes).
Best,
Detmar
--
Prof. Dr. Detmar Meurers, Universität Tübingen http://purl.org/dm
Seminar für Sprachwissenschaft, Wilhelmstr. 19, 72074 Tübingen, Germany
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list