[Corpora-List] Reducing n-gram output

Detmar Meurers dm at ling.ohio-state.edu
Tue Oct 28 20:17:33 UTC 2008


Dear Irina,
    
    I was wondering whether anybody is aware of ideas and/or automated
    processes to reduce n-gram output by solving the common problem that
    shorter n-grams can be fragments of larger structures (e.g. the 5-gram
    'at the end of the' as part of the 6-gram 'at the end of the day')
    
on http://decca.osu.edu you can find the Python code Markus Dickinson,
Adriane Boyd and I used for detecting errors in corpus annotation,
which implements a version of the a priori algorithm to efficiently
compute the longest recurring n-grams in a corpus.  There also are
some papers there discussing the algorithm (the EACL'03 paper is
probably best since the annotations are irrelevant for your purposes).

Best,
Detmar

--
Prof. Dr. Detmar Meurers, Universität Tübingen       http://purl.org/dm
Seminar für Sprachwissenschaft, Wilhelmstr. 19, 72074 Tübingen, Germany

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list