[Corpora-List] Reducing n-gram output

Tue Oct 28 20:17:33 UTC 2008

Dear Irina,

    I was wondering whether anybody is aware of ideas and/or automated
    processes to reduce n-gram output by solving the common problem that
    shorter n-grams can be fragments of larger structures (e.g. the 5-gram
    'at the end of the' as part of the 6-gram 'at the end of the day')

on http://decca.osu.edu you can find the Python code Markus Dickinson,
Adriane Boyd and I used for detecting errors in corpus annotation,
which implements a version of the a priori algorithm to efficiently
compute the longest recurring n-grams in a corpus.  There also are
some papers there discussing the algorithm (the EACL'03 paper is
probably best since the annotations are irrelevant for your purposes).

Best,
Detmar

--
Prof. Dr. Detmar Meurers, Universität Tübingen       http://purl.org/dm
Seminar für Sprachwissenschaft, Wilhelmstr. 19, 72074 Tübingen, Germany

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora