[Corpora-List] producing n-gram lists in java

Constantin Orasan C.Orasan at wlv.ac.uk
Mon Oct 10 18:18:30 UTC 2005


Hi,

Do you have any particular reason why you what to implement this in
java? If you work in Unix or you have cygwin installed, you can produce
very efficiently lists of ngrams sorted by frequency by piping the
output of a program which prints n words on each line to:
 | sort | uniq -c | sort -nr

All you need to do is to produce a program which prints groups n words
on every line. This can be easily achieve by moving a window of n words
across the corpus. 

A perl program which produces these lines is the following (the program
assumes that there is one word on each line):

#!/usr/bin/perl

$n = @ARGV[0]; # the length of ngrams
@list = ();
$i = 0;

while(<STDIN>) {
    $line = $_;

    chop($line);

    # is it a punctuation mark? 
    if(($line eq ".") ||
       ($line eq ",") ||
       ($line eq ";") ||
       ($line eq "_") ||
       ($line eq "\/") ||
       ($line eq "gt") ||
       ($line eq "!") ||
       ($line eq "?") ||
       ($line eq "\/\/") ||
       ($line eq "=") ||
       ($line eq "-") ||
       ($line eq "*") ||
       ($line eq "\$") ||
       ($line eq "\#") ||
       ($line eq ":") ||
       ($line eq "\"") ||
       ($line eq "\'")) {
	# do not include punctuation in ngrams
	$i = 0;
	next;
    }

    if($line =~ /^\s+$/) {
	next;
    }

    if($line =~ /^\s*$/) {
	next;
    }

    if($i == $n) {
	for($j = 0; $j < $n; $j++) {
	    print "@list[$j] ";
	}
	print "\n";
	@list[$i] = $line;
	shift @list;
    } else {
	@list[$i] = $line;
	$i++;
    }
}

Regards,

Constantin

> Dear Corpora List,
> 
> I am currently trying to develop a Java programme to produce a list of the
> most frequently occurring ngrams.
> The problem I have is that the amount of data that needs to be stored in
> memory (currently stored in a hashMap) becomes unmanageably large for any
> corpus greater than about 5 millions words.
> I have attempted to overcome this problem by splitting the corpus into
> batches of 1 million tokens and then collecting all of the smaller ngram
> list files into the final list, but this process was far too slow and
> would have taken many many hours (if not days) to complete.
> I have also created an index of the corpus in the form of an MySql
> database that stores token positions, but I'm unsure of how I could query
> it to produce n-grams (since querying to list for each individual n-gram
> will only lead to the same problems).
> Does anyone know how I might go about creating the ngram-list java programme?
> Thank you for your help,
> Chris
> 
> --------------------------------------------------------------------------
> Christopher Martin
> Computer Science student
> Aston University, Birmingham, UK
> 
> 
-- 
Constantin Orasan
Lecturer in Computational Linguistics
University of Wolverhampton
http://www.wlv.ac.uk/~in6093/



More information about the Corpora mailing list