Corpora: PC-based programs to create lists of n-grams

Mark Davies mdavies at ilstu.edu
Mon Oct 15 13:57:22 UTC 2001


As I mentioned in a related message last week, I'm in the process of
creating a list of 1, 2, and 3-grams (maybe 4 and 5-grams too) in a 100
million word corpus of Spanish.

What I'm looking for is a program that will allow me to create these lists
of n-grams more efficiently than what I have presently.  I need a solution
that has the following features:

** PC-based (DOS or Windows)
** Output in non-propriety ASCII format
** Can easily handle input files as large as 1,000,000 words (hopefully,
much larger)
** Can be run in "batch file" mode, i.e. without human intervention,
process a list of 40 different 1,000,000 word input files, and return 40
output files with the lists of n-grams.

I've been using WordSmith, which can be run in "batch file" mode, and which
has been quite useful.  The problem with WordSmith, however, is that it
exports the list of n-grams in a proprietary format, which then have to
manually be converted -- one by one -- to standard ASCII files.  In
addition, it doesn't much like input files much larger than about one
million words.

I already know that there are some very nice Unix/Linux-based solutions,
but I'm really looking for something that is PC-based, since my students
will also be using something like this in the near future, and all we have
here are PC's :-(.

In addition, I've seen reference to Perl scripts that can be run on a PC,
such as the <bigram-generate.prl> script that comes with the Brill tagger,
and which can be run with Windows ActivePerl.  While I may very well end up
using this or a similar Perl script, I'm also very interested in
"stand-alone" solutions.

Thanks in advance for your help.  I'll post a summary if there is interest.

Mark Davies


====================================================
Mark Davies, Associate Professor, Spanish Linguistics
4300 Foreign Languages, Illinois State University, Normal, IL 61790-4300
309-438-7975 (voice) / 309-438-8083 (fax)
http://mdavies.for.ilstu.edu/

** Corpus design and use / Web-database programming and optimization **
** Historical and dialectal Spanish and Portuguese syntax / Distance
education **
=====================================================



More information about the Corpora mailing list