[Corpora-List] N-gram string extraction

Tue Aug 27 14:16:54 UTC 2002

Dear list members,

I am currently working on extraction of statistically significant n-gram
(1<n<6) strings of alpha-numerical characters from a 100 mln character
corpus, and I intend to apply different significance tests (MI, t-score,
log-likelihood etc.) on these strings. I'm testing Ted Pedersen's N-gram
Statistics Package, which seems being able to produce the tasks, however
it hasn't produced any results after one week of running.
I have a couple of queries regarding n-gram extraction:
1. I'd like to ask if members of the list are aware of similar software
capable of accomplishing the above mentioned tasks reliably and
efficiently.
2. And a statistical question. As I need to count association scores for
trigrams, tetragrams, and pentagrams as well, I plan to split them into
bigrams consisting of a string of words plus one word [n-1]+[1] and
count association scores for them.
Does anyone know if this is a right thing to do from a statistical point
of view?

Thank you,
Andrius Utka

Research Assistant
Centre for Corpus Linguistics
University of Birmingham