[Corpora-List] [osander at gmx.de: How to extract N-grams]

P bI K O B_ B.B. rykov at narod.ru
Tue Sep 3 05:54:35 UTC 2002


  Stefan - I'm game (as my Irish postgrad friend used to tell me).
  I think - every word in your letter is true.

  Vladimir Rykov

>
>Oliver Sander had some further comments on methods for the extraction
>of N-grams from corpora, but wasn't sure whether the people on the
>list would be interested to read them. Well, I think we are.
>
>I'd like to use this occasion for a short comment on the Computational
>Linguistics article by Yamamoto and Church that was mentioned in this
>context. Very interesting work, indeed, but I can't help the feeling
>that if only we had a MIPS 10000 with 16 GBytes of RAM, the naive
>approach would do quite well, too. At least for N-grams of any
>reasonable size.
>
>Cheers,
>Stefan
>
>------- Start of forwarded message -------
>From: Oliver Sander <osander at gmx.de>
>Subject: How to extract N-grams
>Date: Sun, 1 Sep 2002 13:42:57 +0200
>
>-snip-------------------------------------------------------------------------
>Hello.
>
>[David Graff]
>>    On the contrary, using Perl on a large data set can be reasonably
>>    economical in terms of memory usage if the Perl code is written
>>    reasonably well, which is likely true in the case of Ted Pederson's
>>    package.  (Sure, it might take up more active RAM than the equivalent
>>    program written in C in most cases, and it certainly is possible to
>>    write Perl code badly, such that it would run out of memory on any
>>    machine -- the same thing can happen in C, of course...)
>
>I basically have the same opinion. Perl does have a memory overhead compared
>to C/C++, but this overhead grows quite nicely with problem size. So if a
>computation is completely unfeasable with Perl, chances are bad, that a C
>implementation can get this right. Even more, the simplicity of Perl makes it
>easier to play with different approaches. If such an approach seems to be
>interesting, it might be a good idea to implement it in C, as a kind of
>"production version".
>
>[Stefan Evert]
>> Memory usage is usually the biggest culprit in performance problems
>> with large amounts of data. Once the machine runs out of physical RAM
>> and has to use the swap partition, performance decreases extremely.
>> Which can be so bad as to give the impression that the program got
>> stuck entirely (we've had such a case with a hash implementation in C,
>> actually). Therefore, reducing memory load even by a constant factor
>> of three or so can make an enormous difference.
>
>I agree, that saving storage space, to fit the problem into main memory is
>sometimes a good solution. But if problems have to be handled, that does not
>fit into RAM anyway (like 100 mio tokens), you have to take care of memory
>management yourself. And that may be easier in Perl, Python or another
>scripting language
>
>[Stefan Evert]
>> I'm not sure what you understand by "reasonably well", but for the
>> case of counting N-grams I would take the best solution to be
>>
>> $freq{join("\t", @n_gram)}++;
>
>In terms of computation time (if no memory swapping is needed), this is a
>good approach. But the memory consumption is quite high. You have to store
>*all* appearing n-grams, also if they occur only once. When you encounter n
>consecutive words you don't know if they occur somewhere else, so you have to
>store them.
>
>[Christer Johansson]
>>    Somewhere a sort operation is needed.
>
>I think that is a good direction.
>Nagao ("A new method of N-gram statistics for large number of n and
>        automatic extraction of words and phrases from large text data of
>        Japanese")
>suggests a sorted suffix-array to extract n-grams. The sorting takes n*lg n,
>but the huge memory overhead of storing all appearing consecutive tokens is
>avoided. Moreover, the sorting can be implemented with a merging of sorted
>subsets, which makes it possible (rather easily) to put parts of the data on
>disc, whereas only some parts are held in main memory.
>I think this sorted suffix-array technique is the best scalable (with problem
>size) algorithm available.
>
>[An implementation of this algorithm seems to be available from the
>author's homepage. --Stefan]
>
>Hope, this helped.
>Cheers,
>Oliver Sander
>------- End of forwarded message -------
>


--

    P bI K O B  B.B.   MOCKBA

Vladimir Rykov, PhD in Computational Linguistics,
 MOSCOW
http://rykov.narod.ru/
Engl. http://www.blkbox.com/~gigawatt/rykov.html
Tel +7-903-749-19-99



More information about the Corpora mailing list