[Corpora-List] [osander at gmx.de: How to extract N-grams]

Mon Sep 2 17:50:42 UTC 2002

Oliver Sander had some further comments on methods for the extraction
of N-grams from corpora, but wasn't sure whether the people on the
list would be interested to read them. Well, I think we are.

I'd like to use this occasion for a short comment on the Computational
Linguistics article by Yamamoto and Church that was mentioned in this
context. Very interesting work, indeed, but I can't help the feeling
that if only we had a MIPS 10000 with 16 GBytes of RAM, the naive
approach would do quite well, too. At least for N-grams of any
reasonable size.

Cheers,
Stefan

------- Start of forwarded message -------
From: Oliver Sander <osander at gmx.de>
Subject: How to extract N-grams
Date: Sun, 1 Sep 2002 13:42:57 +0200

-snip-------------------------------------------------------------------------
Hello.

[David Graff]
>    On the contrary, using Perl on a large data set can be reasonably
>    economical in terms of memory usage if the Perl code is written
>    reasonably well, which is likely true in the case of Ted Pederson's
>    package.  (Sure, it might take up more active RAM than the equivalent
>    program written in C in most cases, and it certainly is possible to
>    write Perl code badly, such that it would run out of memory on any
>    machine -- the same thing can happen in C, of course...)

I basically have the same opinion. Perl does have a memory overhead compared
to C/C++, but this overhead grows quite nicely with problem size. So if a
computation is completely unfeasable with Perl, chances are bad, that a C
implementation can get this right. Even more, the simplicity of Perl makes it
easier to play with different approaches. If such an approach seems to be
interesting, it might be a good idea to implement it in C, as a kind of
"production version".

[Stefan Evert]
> Memory usage is usually the biggest culprit in performance problems
> with large amounts of data. Once the machine runs out of physical RAM
> and has to use the swap partition, performance decreases extremely.
> Which can be so bad as to give the impression that the program got
> stuck entirely (we've had such a case with a hash implementation in C,
> actually). Therefore, reducing memory load even by a constant factor
> of three or so can make an enormous difference.

I agree, that saving storage space, to fit the problem into main memory is
sometimes a good solution. But if problems have to be handled, that does not
fit into RAM anyway (like 100 mio tokens), you have to take care of memory
management yourself. And that may be easier in Perl, Python or another
scripting language

[Stefan Evert]
> I'm not sure what you understand by "reasonably well", but for the
> case of counting N-grams I would take the best solution to be
>
> $freq{join("\t", @n_gram)}++;

In terms of computation time (if no memory swapping is needed), this is a
good approach. But the memory consumption is quite high. You have to store
*all* appearing n-grams, also if they occur only once. When you encounter n
consecutive words you don't know if they occur somewhere else, so you have to
store them.

[Christer Johansson]
>    Somewhere a sort operation is needed.

I think that is a good direction.
Nagao ("A new method of N-gram statistics for large number of n and
        automatic extraction of words and phrases from large text data of
        Japanese")
suggests a sorted suffix-array to extract n-grams. The sorting takes n*lg n,
but the huge memory overhead of storing all appearing consecutive tokens is
avoided. Moreover, the sorting can be implemented with a merging of sorted
subsets, which makes it possible (rather easily) to put parts of the data on
disc, whereas only some parts are held in main memory.
I think this sorted suffix-array technique is the best scalable (with problem
size) algorithm available.

[An implementation of this algorithm seems to be available from the
author's homepage. --Stefan]

Hope, this helped.
Cheers,
Oliver Sander
------- End of forwarded message -------