[Corpora-List] producing n-gram lists in java
Chris Callison-Burch
callison-burch at ed.ac.uk
Mon Oct 10 17:04:07 UTC 2005
Dear Christopher,
The suffix array data structure is particularly useful for this type of
problem. You should check out "Using Suffix Arrays to compute Term
Frequency and Document Frequency for All Substrings in a Corpus" by
Yamamoto and Church, which has code examples written in C that are
pretty easy to replicate in Java.
Yours,
Chris Callison-Burch
@article{ Yamamoto2001,
author = {Mikio Yamamoto and Kenneth Church},
title = {Using Suffix Arrays to compute Term Frequency and Document
Frequency for All Substrings in a Corpus},
journal = {Compuatational Linguistics},
year = {2001},
volume = {27},
number = {1},
pages = {1--30},
url = {http://acl.ldc.upenn.edu/J/J01/J01-1001.pdf}
}
On Oct 10, 2005, at 5:15 PM, martincd at aston.ac.uk wrote:
> Dear Corpora List,
>
> I am currently trying to develop a Java programme to produce a list of
> the
> most frequently occurring ngrams.
> The problem I have is that the amount of data that needs to be stored
> in
> memory (currently stored in a hashMap) becomes unmanageably large for
> any
> corpus greater than about 5 millions words.
> I have attempted to overcome this problem by splitting the corpus into
> batches of 1 million tokens and then collecting all of the smaller
> ngram
> list files into the final list, but this process was far too slow and
> would have taken many many hours (if not days) to complete.
> I have also created an index of the corpus in the form of an MySql
> database that stores token positions, but I'm unsure of how I could
> query
> it to produce n-grams (since querying to list for each individual
> n-gram
> will only lead to the same problems).
> Does anyone know how I might go about creating the ngram-list java
> programme?
> Thank you for your help,
> Chris
>
> -----------------------------------------------------------------------
> ---
> Christopher Martin
> Computer Science student
> Aston University, Birmingham, UK
>
>
More information about the Corpora
mailing list