[Corpora-List] producing n-gram lists in java

Chris Jordan cjordan at cs.dal.ca
Mon Oct 10 17:42:23 UTC 2005


Is Java a requirement? There are some good utilities for this in Perl 
such as:
http://search.cpan.org/~vlado/Text-Ngrams-1.7/Ngrams.pm
(shameless plug for one of my profs :P)
Seriously though, it is a good utility and if you are just doing text 
processing it shouldn't really matter what language you are doing it in.

Alternatively, have you tried increasing the amount of memory the JVM is 
using? Try rerunning your program with the flags:
-Xmx768m
This flag makes the virtual machine use 768Mbs of RAM opposed to the 
default 64Mbs (or 32Mbs... can't remember).


martincd at aston.ac.uk wrote:

>Dear Corpora List,
>
>I am currently trying to develop a Java programme to produce a list of the
>most frequently occurring ngrams.
>The problem I have is that the amount of data that needs to be stored in
>memory (currently stored in a hashMap) becomes unmanageably large for any
>corpus greater than about 5 millions words.
>I have attempted to overcome this problem by splitting the corpus into
>batches of 1 million tokens and then collecting all of the smaller ngram
>list files into the final list, but this process was far too slow and
>would have taken many many hours (if not days) to complete.
>I have also created an index of the corpus in the form of an MySql
>database that stores token positions, but I'm unsure of how I could query
>it to produce n-grams (since querying to list for each individual n-gram
>will only lead to the same problems).
>Does anyone know how I might go about creating the ngram-list java programme?
>Thank you for your help,
>Chris
>
>--------------------------------------------------------------------------
>Christopher Martin
>Computer Science student
>Aston University, Birmingham, UK
>
>
>  
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: cjordan.vcf
Type: text/x-vcard
Size: 345 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20051010/3ffa1f26/attachment-0001.vcf>


More information about the Corpora mailing list