Corpora: PC-based programs to create lists of n-grams

Dragomir Radev radev at si.umich.edu
Mon Oct 15 20:08:06 UTC 2001


Check the CMU-Cambridge Language Modeling toolkit:

http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html

Drago

Mark Davies wrote:
>
> As I mentioned in a related message last week, I'm in the process of
> creating a list of 1, 2, and 3-grams (maybe 4 and 5-grams too) in a 100
> million word corpus of Spanish.
>
> What I'm looking for is a program that will allow me to create these lists
> of n-grams more efficiently than what I have presently.  I need a solution
> that has the following features:
>
> ** PC-based (DOS or Windows)
> ** Output in non-propriety ASCII format
> ** Can easily handle input files as large as 1,000,000 words (hopefully,
> much larger)
> ** Can be run in "batch file" mode, i.e. without human intervention,
> process a list of 40 different 1,000,000 word input files, and return 40
> output files with the lists of n-grams.
>
> I've been using WordSmith, which can be run in "batch file" mode, and which
> has been quite useful.  The problem with WordSmith, however, is that it
> exports the list of n-grams in a proprietary format, which then have to
> manually be converted -- one by one -- to standard ASCII files.  In
> addition, it doesn't much like input files much larger than about one
> million words.
>
> I already know that there are some very nice Unix/Linux-based solutions,
> but I'm really looking for something that is PC-based, since my students
> will also be using something like this in the near future, and all we have
> here are PC's :-(.
>
> In addition, I've seen reference to Perl scripts that can be run on a PC,
> such as the <bigram-generate.prl> script that comes with the Brill tagger,
> and which can be run with Windows ActivePerl.  While I may very well end up
> using this or a similar Perl script, I'm also very interested in
> "stand-alone" solutions.
>
> Thanks in advance for your help.  I'll post a summary if there is interest.
>
> Mark Davies
>
>
> ====================================================
> Mark Davies, Associate Professor, Spanish Linguistics
> 4300 Foreign Languages, Illinois State University, Normal, IL 61790-4300
> 309-438-7975 (voice) / 309-438-8083 (fax)
> http://mdavies.for.ilstu.edu/
>
> ** Corpus design and use / Web-database programming and optimization **
> ** Historical and dialectal Spanish and Portuguese syntax / Distance
> education **
> =====================================================
>
>
>


--
Dragomir R. Radev                                         radev at umich.edu
Assistant Professor of Information, Electrical Engineering and
Computer Science, and Linguistics, the University of Michigan, Ann Arbor
Phone: 734-615-5225   Fax: 734-764-2475    http://www.si.umich.edu/~radev



More information about the Corpora mailing list