[Corpora-List] producing n-gram lists in java

Stefan Evert evert at IMS.Uni-Stuttgart.DE
Tue Oct 11 10:34:48 UTC 2005


> Is Java a requirement? There are some good utilities for this in Perl 
> such as:
> http://search.cpan.org/~vlado/Text-Ngrams-1.7/Ngrams.pm
> (shameless plug for one of my profs :P)
> Seriously though, it is a good utility and if you are just doing text 
> processing it shouldn't really matter what language you are doing it in.
> 

To put in another plug, if you aren't tied to Java and Windows, and if
you're looking for a quick solution, you might try the IMS Corpus
Workbench (http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/).
The cwb-scan-corpus program included in the Workbench can handle
shorter n-grams from corpora of BNC dimensions, especially when the
parts of speech of the component words are restricted (the Corpus
Encoding Tutorial on the "Users' Corner" page gives some examples of
how the program is used). 

If you need to handle very long n-grams (or very large corpora), you
should go for suffix trees. You should be aware, though, that the
sorting step in Yamamoto & Church's implementation is a very expensive
operation and will take it's time. There are other implementations of
suffix trees that build frequency lists in memory (you have to limit
the maximal size of the n-grams, though), but I don't know how well
they handle very large data sets.

Hope this hilft,
Stefan.



More information about the Corpora mailing list