[Corpora-List] Perl efficiency (Re: N-gram string extraction)

Stefan Evert evert at IMS.Uni-Stuttgart.DE
Thu Aug 29 09:18:57 UTC 2002


At first I thought this would be off-topic and too technical for the
list, especially now that Andrius has solved his problem. However,
since a lot of the respondents seem to be using Perl for large data
manipulation I'd like to share my experiences.


David Graff wrote:

   On the contrary, using Perl on a large data set can be reasonably
   economical in terms of memory usage if the Perl code is written
   reasonably well, which is likely true in the case of Ted Pederson's
   package.  (Sure, it might take up more active RAM than the equivalent
   program written in C in most cases, and it certainly is possible to
   write Perl code badly, such that it would run out of memory on any
   machine -- the same thing can happen in C, of course...)

I'm not sure what you understand by "reasonably well", but for the
case of counting N-grams I would take the best solution to be

$freq{join("\t", @n_gram)}++;

where the array @n_gram holds the components of an N-gram token. (Less
experienced Perl programmers might be tempted to use nested arrays, as
in $freq{$c1}{$c2}{$c3}{$c4}++; but that isn't easily generalised to
arbitary N-grams anyway). Although Perl's hashes are known to be fast,
its memory footprint is much larger than that of a similar C program
(in my experience at least the 3:1 ratio one can observe for Java
against C/C++, e.g. with different implementations of Xalan).

Memory usage is usually the biggest culprit in performance problems
with large amounts of data. Once the machine runs out of physical RAM
and has to use the swap partition, performance decreases extremely.
Which can be so bad as to give the impression that the program got
stuck entirely (we've had such a case with a hash implementation in C,
actually). Therefore, reducing memory load even by a constant factor
of three or so can make an enormous difference.


Gregor Erbach suggested:

   off the top of my head, I would think that the memory problems arise
   from the large number of n-grams for which counts are held in memory.
   so i don't quite see how c/c++ could improve the situation.

That's quite right. The number of different N-grams grows more or less
exponentially with N. For instance, I've found almost 250,000
different character 6-grams among 1 million characters of German text.

The main difference between C/C++ and Perl (or Java, at that) is the
amount of overhead. For every N-gram type (i.e. every different
sequence of N characters) in the corpus, an optimised C implementation
would typically require N+1 bytes for the N-gram string, 4 bytes for
the N-gram frequency (as a 32 bit integer) and another 8-20 bytes for
pointers and indices. Storing the same information in a Perl hash
requires for each N-gram: a Perl string (which can grow dynamically
and does probably some other management information as well) as key,
and a full-fledged Perl scalar (which is quite a big data structure)
for the frequency value.

Even worse, Perl will allocate the necessary memory in small chunks as
it is needed, whereas the C implementation can store its fixed-size
data in large tables allocated during startup.


Apart from the memory problems, there is no denying that Perl is
inherently a slow language, especially when you write "C in Perl" (as
far as I know, simple computations can be as much as 300 times faster
in C).

Although Perl's regular expression engine has a very good reputation,
a fairly general tokenizer written in Perl -- which has to try various
regular expressions for every token identified -- will invariably be
slow. Even if it uses regexes as simple as /\w/ or /\w+/.



Christer Johansson's guess was:

   Somewhere a sort operation is needed.

I don't think a sort is really necessary in this case (after all, it's
just about counting N-gram frequencies), but there is a fair chance
that the output is sorted, either alphabetically or by frequency.

   I guess that sort operation is
   implemented in a "simple for the programmer" way. Which means that it is
   likely somewhere between n*n and n*n*n in time. Unix sort uses more efficient
   algorithms that are more likely n*log n.   One million keys would take
   between 10^12 and 10^18 operations in the slow versions, in the fast sort
   version it is 10^6*log(2?) of 10^6; is it somewhere near 20*10^6? This
   is most likely where your problem is.

Perl uses the quicksort algorithm (if I'm not mistaken, it does
actually call qsort() from the standard library), so it should have a
complexity of O(n*log n). However, at least when a custom sort order
is used (this includes sorting by reverse frequency), a callback to
some Perl code has to be made for each comparison. In my experience,
this can cause a substantial slowdown for large sorts, although it's
never been a critical problem. It is much more efficient to save the
frequency data in whatever order they come out of the Perl hash, and
then apply an external sort program.



Below, you'll find the best Perl solution that I can think of at the
moment. Andrius, if you can spare the time it would be interesting to
see how it compares to Ted's general tool. Usage is

ngrams.perl 5 file1.txt file2.txt ... > ngrams.list

etc. I tested it with 1 million characters of German text. On my
machine, it took about 3 minutes to extract 6-grams from the data, but
it did eat up quite a lot of memory. I don't know how it performs on
files with missing newlines, though ...

Cheers,
Stefan.

==================== ngrams.perl ====================

#!/usr/bin/perl -w
# $| = 1;
use strict;

my $N = shift @ARGV;

my %joint_f;
my @marginal_f;

my ($j, $f, $idx, $max_idx, $ngram);

my $tokens = 0;
my $data = "";
while (<>) {
  print STDERR "processing line #$.\r" if ($. & 0x1fff) == 0;
  chomp;
  $data .= " ".$_; # assumes line breaks correspond to blanks
  $max_idx = length($data) - $N;
  for ($idx = 0; $idx <= $max_idx; $idx++) {
    $ngram = substr($data,$idx,$N);
    $joint_f{$ngram}++;
    for $j (0 .. $N-1) {
      $marginal_f[$j]{substr($ngram,$j,1)}++;
    }
    $tokens++;
  }
  $data = substr($data, -$N);
}

print STDERR "\nWriting $N-grams to stdout ... ";

my $types = 0;
print "\tf\t", join("\t", map{"f$_"} (1..$N)), "\n";
while (($ngram, $f) = each %joint_f) {
  $types++;
  print "$ngram\t$f";
  for $j (0 .. $N-1) {
    print "\t",$marginal_f[$j]{substr($ngram,$j,1)};
  }
  print "\n";
}

print STDERR "Done.\n [$tokens tokens, $types types]\n";



More information about the Corpora mailing list