[Corpora-List] N-gram string extraction
Stefan Evert
evert at IMS.Uni-Stuttgart.DE
Tue Aug 27 15:12:33 UTC 2002
Hi there!
I am currently working on extraction of statistically significant n-gram
(1<n<6) strings of alpha-numerical characters from a 100 mln character
corpus, and I intend to apply different significance tests (MI, t-score,
log-likelihood etc.) on these strings. I'm testing Ted Pedersen's N-gram
Statistics Package, which seems being able to produce the tasks, however
it hasn't produced any results after one week of running.
That's very probably because it's written in Perl and eating up lots
of memory. I don't think there's a way around C/C++ for problems of
that size (at the moment, at least).
I always thought of NSP as a tool for counting N-grams of _tokens_
rather than character. Apparently, you can change its definition of
token, but that means using a trivial regular expressions to chop
single characters from your 100 million input corpus. Which is going
to take ages.
I have a couple of queries regarding n-gram extraction:
1. I'd like to ask if members of the list are aware of similar software
capable of accomplishing the above mentioned tasks reliably and
efficiently.
I'm afraid I don't know of any such tools. Technically, counting
N-grams produces a very simplistic statistical language model (the
kind used to generate random poetry), so perhaps you can dig up
something in that area.
On the other hand, if you aren't tied to Windows (i.e.\ you have
access to a Linux or Solaris computer), there's the IMS Corpus
Workbench:
http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/
which isn't quite as outdated as that web page suggests. Although it
isn't obvious from the online materials, the Corpus Workbench could be
abused (with the help of a simple Perl script) to do what you want (at
the price of wasting lots of disk space). Kind of a last resort, I
guess.
2. And a statistical question. As I need to count association scores for
trigrams, tetragrams, and pentagrams as well, I plan to split them into
bigrams consisting of a string of words plus one word [n-1]+[1] and
count association scores for them.
Does anyone know if this is a right thing to do from a statistical point
of view?
Again, I don't know of any well-founded discussion of association
scores for N-grams in the literature. I consider it an intriguing
problem and plan to do some work in this area when I've finished my
thesis on bigram associations.
The most systematic approach to N-grams I've come across is
J.F. da Silva; G.P. Lopes. "A Local Maxima method and Fair Dispersion
Normalization for extracting multi-word units from corpora." MOL 6,
1999.
which can be downloaded from the first author's homepage at
http://terra.di.fct.unl.pt/~jfs/
Their approach is based on breaking up N-grams into pairs of [n-1]+[1]
words, too, but I must say that I'm not really convinced this is the
right way to go.
Cheers,
Stefan.
--
Moral: Early to rise and early to bed
makes a male healthy and wealthy and dead.
______________________________________________________________________
C.E.R.T. Marbach (CQP Emergency Response Team)
http://www.ims.uni-stuttgart.de/~evert schtepf at gmx.de
More information about the Corpora
mailing list