[Corpora-List] Perl efficiency (Re: N-gram string extraction)

Thu Aug 29 09:50:53 UTC 2002

A few observations to add to Stefan's excellent email.

First up, I think that writing in Perl is the correct solution for
something you need to do now. For text processing tasks, nothing that
I've seen (Java, C, C++) even comes close in terms of development
time. If you're willing to spend an order of magnitude more time
writing code, then you will of course get a solution that will run
faster and smaller, but you're trading off your time (valuable) for
CPU cycles (cheap). The point at which that tradeoff becomes
worthwhile will vary a lot from person to person.

Stefan's point about Perl's memory usage is well taken. In this case
though, when you know that the data structure is a simple
string->number hash, and you know you'll be dealing with many many
keys, then simply tie'ing the hash to a disk file, moving the memory
usage issue from RAM to disk. The code will never swap, and you could
potentially deal with many more ngrams than an in-core C/C++/Java
implementation could deal with. There may be nice libraries for doing
the equivalent of a tie, but are they as easy to use?

Anyhow - "Programming Perl" has lots more to say about various forms
of efficiency in Perl. If you're loathe to spend time crafting code in
languages less suited to processing text, then have a browse through
the Efficiency section in Chapter 25 - you may not need to give up as
much speed as you think you do.

S.