[Corpora-List] N-gram string extraction

Ted Pedersen ted_pedersen at hotmail.com
Tue Aug 27 18:25:26 UTC 2002


Hi Andrius,

We are always happy to hear from users of BSP/NSP. In fact, we
nearly beg folks to contact us in our READMEs, etc.
Perhaps you could send me some additional details of what you
are trying to do, and how you have done it thus far?
I'm at : tpederse at umn.edu

One newly added power to NSP is that it allows the user
to define tokens using regular expressions. So you can say that
tokens are 2 word sequences that start with the letter 'a'. Or
they can be two character long sequences, or they can be single
characters, etc. They can be whatever they want to be really.
However, a poorly crafted or very complex regular expressions
can really lead to problems with performance. So the first thing
I would look at is how you are defining your tokens - and I'd
be happy to do this - you just need to contact me.

For anyone on the list who doesn't know where to find NSP or
what it is, here it is:

http://www.d.umn.edu/~tpederse/nsp.html

Cordially,
Ted Pedersen

>Dear list members,
>
>I am currently working on extraction of statistically significant n-gram
>(1<n<6) strings of alpha-numerical characters from a 100 mln character
>corpus, and I intend to apply different significance tests (MI, t-score,
>log-likelihood etc.) on these strings. I'm testing Ted Pedersen's N-gram
>Statistics Package, which seems being able to produce the tasks, however
>it hasn't produced any results after one week of running.
>I have a couple of queries regarding n-gram extraction:
>1. I'd like to ask if members of the list are aware of similar software
>capable of accomplishing the above mentioned tasks reliably and
>efficiently.
>2. And a statistical question. As I need to count association scores for
>trigrams, tetragrams, and pentagrams as well, I plan to split them into
>bigrams consisting of a string of words plus one word [n-1]+[1] and
>count association scores for them.
>Does anyone know if this is a right thing to do from a statistical point
>of view?
>
>Thank you,
>Andrius Utka
>
>Research Assistant
>Centre for Corpus Linguistics
>University of Birmingham




--
Ted Pedersen
http://www.umn.edu/~tpederse


_________________________________________________________________
Join the world’s largest e-mail service with MSN Hotmail.
http://www.hotmail.com



More information about the Corpora mailing list