[Corpora-List] N-gram string extraction

Klas Prutz klas.prytz at ling.uu.se
Tue Aug 27 14:39:45 UTC 2002


Hi,

Just one question: what is a significant n-gram?
In realtion to what?

Ragards    

Klas Prytz
Institutionen för lingvistik
Uppsala universitet


On Tue, 27 Aug 2002 andrius at ccl.bham.ac.uk wrote:

> Dear list members,
> 
> I am currently working on extraction of statistically significant n-gram
> (1<n<6) strings of alpha-numerical characters from a 100 mln character
> corpus, and I intend to apply different significance tests (MI, t-score,
> log-likelihood etc.) on these strings. I'm testing Ted Pedersen's N-gram
> Statistics Package, which seems being able to produce the tasks, however
> it hasn't produced any results after one week of running.
> I have a couple of queries regarding n-gram extraction:
> 1. I'd like to ask if members of the list are aware of similar software
> capable of accomplishing the above mentioned tasks reliably and
> efficiently.
> 2. And a statistical question. As I need to count association scores for
> trigrams, tetragrams, and pentagrams as well, I plan to split them into
> bigrams consisting of a string of words plus one word [n-1]+[1] and
> count association scores for them.
> Does anyone know if this is a right thing to do from a statistical point
> of view?
> 
> Thank you,
> Andrius Utka
> 
> Research Assistant
> Centre for Corpus Linguistics
> University of Birmingham
> 
> 



More information about the Corpora mailing list