[Corpora-List] N-gram string extraction

David Graff graff at unagi.cis.upenn.edu
Tue Aug 27 15:51:40 UTC 2002


evert at IMS.Uni-Stuttgart.DE said:
>    I am currently working on extraction of statistically significant
>    n-gram (1<n<6) strings of alpha-numerical characters from a 100 mln
>    character corpus, and I intend to apply different significance tests
>    (MI, t-score, log-likelihood etc.) on these strings. I'm testing Ted
>    Pedersen's N-gram Statistics Package, which seems being able to
>    produce the tasks, however it hasn't produced any results after one
>    week of running.
>
> That's very probably because it's written in Perl and eating up lots
> of memory. I don't think there's a way around C/C++ for problems of
> that size (at the moment, at least).

On the contrary, using Perl on a large data set can be reasonably
economical in terms of memory usage if the Perl code is written
reasonably well, which is likely true in the case of Ted Pederson's
package.  (Sure, it might take up more active RAM than the equivalent
program written in C in most cases, and it certainly is possible to
write Perl code badly, such that it would run out of memory on any
machine -- the same thing can happen in C, of course...)

In this case, it's more likely that the user is missing something
simple about the basic usage of the package's utility programs -- e.g.
if a Perl program (let's call it "util.perl") is written in this manner:

  #!/usr/bin/perl

  while (<>) {
     # do stuff...
  }

and the user simply runs the program at the command line like this:

  util.perl

that is, with no file name, and no pipeline or redirection to provide
data on STDIN for the program, it will "run" indefinitely, until the
user kills it somehow -- it's just waiting for input data to work on.

Check the documentation for the utility program(s) in question; it may
just be a matter of making sure that you are using one of the following
kinds of command line:

   cat data.file | util.perl
or
   util.perl < data.file
or
   util.perl data.file

Or it may be something more subtle in the usage of the package
programs -- but it's bound to be just a matter of getting the usage
right.

-----------
David Graff			Linguistic Data Consortium
graff at ldc.upenn.edu		3615 Market St., Suite 200
voice: (215) 898-0887		University of Pennsylvania
fax:   (215) 573-2175		Philadelphia, PA 19104
		http://www.ldc.upenn.edu



More information about the Corpora mailing list