[Corpora-List] N-gram string extraction

Tue Aug 27 15:12:33 UTC 2002

Hi there!

   I am currently working on extraction of statistically significant n-gram
   (1<n<6) strings of alpha-numerical characters from a 100 mln character
   corpus, and I intend to apply different significance tests (MI, t-score,
   log-likelihood etc.) on these strings. I'm testing Ted Pedersen's N-gram
   Statistics Package, which seems being able to produce the tasks, however
   it hasn't produced any results after one week of running.

That's very probably because it's written in Perl and eating up lots
of memory. I don't think there's a way around C/C++ for problems of
that size (at the moment, at least).

I always thought of NSP as a tool for counting N-grams of _tokens_
rather than character. Apparently, you can change its definition of
token, but that means using a trivial regular expressions to chop
single characters from your 100 million input corpus. Which is going
to take ages.

   I have a couple of queries regarding n-gram extraction:
   1. I'd like to ask if members of the list are aware of similar software
   capable of accomplishing the above mentioned tasks reliably and
   efficiently.

I'm afraid I don't know of any such tools. Technically, counting
N-grams produces a very simplistic statistical language model (the
kind used to generate random poetry), so perhaps you can dig up
something in that area.

On the other hand, if you aren't tied to Windows (i.e.\ you have
access to a Linux or Solaris computer), there's the IMS Corpus
Workbench:

http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/

which isn't quite as outdated as that web page suggests. Although it
isn't obvious from the online materials, the Corpus Workbench could be
abused (with the help of a simple Perl script) to do what you want (at
the price of wasting lots of disk space). Kind of a last resort, I
guess.

   2. And a statistical question. As I need to count association scores for
   trigrams, tetragrams, and pentagrams as well, I plan to split them into
   bigrams consisting of a string of words plus one word [n-1]+[1] and
   count association scores for them.
   Does anyone know if this is a right thing to do from a statistical point
   of view?

Again, I don't know of any well-founded discussion of association
scores for N-grams in the literature. I consider it an intriguing
problem and plan to do some work in this area when I've finished my
thesis on bigram associations.

The most systematic approach to N-grams I've come across is

J.F. da Silva; G.P. Lopes. "A Local Maxima method and Fair Dispersion
Normalization for extracting multi-word units from corpora." MOL 6,
1999.

which can be downloaded from the first author's homepage at

  http://terra.di.fct.unl.pt/~jfs/

Their approach is based on breaking up N-grams into pairs of [n-1]+[1]
words, too, but I must say that I'm not really convinced this is the
right way to go.

Cheers,
Stefan.

--
Moral: Early to rise and early to bed
       makes a male healthy and wealthy and dead.
______________________________________________________________________
C.E.R.T. Marbach                         (CQP Emergency Response Team)
http://www.ims.uni-stuttgart.de/~evert                  schtepf at gmx.de