[Corpora-List] N-gram string extraction

Chris Brew cbrew at ling.ohio-state.edu
Tue Aug 27 15:27:29 UTC 2002


There's a recent publication by Mikio Yamamoto and Kenneth W. Church, Computational Linguistics, 27 (1) 1-30 which shows efficient ways to compute a number of
interesting quantities over all substrings in a corpus.

Very nice work

C


On Tue, Aug 27, 2002 at 05:12:33PM +0200, Stefan Evert wrote:
>
> Hi there!
>
>    I am currently working on extraction of statistically significant n-gram
>    (1<n<6) strings of alpha-numerical characters from a 100 mln character
>    corpus, and I intend to apply different significance tests (MI, t-score,
>    log-likelihood etc.) on these strings. I'm testing Ted Pedersen's N-gram
>    Statistics Package, which seems being able to produce the tasks, however
>    it hasn't produced any results after one week of running.
>
> That's very probably because it's written in Perl and eating up lots
> of memory. I don't think there's a way around C/C++ for problems of
> that size (at the moment, at least).
>
> I always thought of NSP as a tool for counting N-grams of _tokens_
> rather than character. Apparently, you can change its definition of
> token, but that means using a trivial regular expressions to chop
> single characters from your 100 million input corpus. Which is going
> to take ages.
>
>    I have a couple of queries regarding n-gram extraction:
>    1. I'd like to ask if members of the list are aware of similar software
>    capable of accomplishing the above mentioned tasks reliably and
>    efficiently.
>
> I'm afraid I don't know of any such tools. Technically, counting
> N-grams produces a very simplistic statistical language model (the
> kind used to generate random poetry), so perhaps you can dig up
> something in that area.
>
> On the other hand, if you aren't tied to Windows (i.e.\ you have
> access to a Linux or Solaris computer), there's the IMS Corpus
> Workbench:
>
> http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/
>
> which isn't quite as outdated as that web page suggests. Although it
> isn't obvious from the online materials, the Corpus Workbench could be
> abused (with the help of a simple Perl script) to do what you want (at
> the price of wasting lots of disk space). Kind of a last resort, I
> guess.
>
>    2. And a statistical question. As I need to count association scores for
>    trigrams, tetragrams, and pentagrams as well, I plan to split them into
>    bigrams consisting of a string of words plus one word [n-1]+[1] and
>    count association scores for them.
>    Does anyone know if this is a right thing to do from a statistical point
>    of view?
>
> Again, I don't know of any well-founded discussion of association
> scores for N-grams in the literature. I consider it an intriguing
> problem and plan to do some work in this area when I've finished my
> thesis on bigram associations.
>
> The most systematic approach to N-grams I've come across is
>
> J.F. da Silva; G.P. Lopes. "A Local Maxima method and Fair Dispersion
> Normalization for extracting multi-word units from corpora." MOL 6,
> 1999.
>
> which can be downloaded from the first author's homepage at
>
>   http://terra.di.fct.unl.pt/~jfs/
>
> Their approach is based on breaking up N-grams into pairs of [n-1]+[1]
> words, too, but I must say that I'm not really convinced this is the
> right way to go.
>
> Cheers,
> Stefan.
>
> --
> Moral: Early to rise and early to bed
>        makes a male healthy and wealthy and dead.
> ______________________________________________________________________
> C.E.R.T. Marbach                         (CQP Emergency Response Team)
> http://www.ims.uni-stuttgart.de/~evert                  schtepf at gmx.de
>

--
=================================================================
Dr. Chris Brew,  Assistant Professor of Computational Linguistics
Department of Linguistics, 1712 Neil Avenue, Columbus OH 43210
Tel:  +614 292 5420 Fax: +614 292 8833
Email:cbrew at ling.osu.edu
=================================================================



More information about the Corpora mailing list