[Corpora-List] Corpus-building for minority languages

Fri Mar 19 19:20:20 UTC 2004

> Can you give a rough comparison (on the mailing list) of how this compares
> with CorpusBuilder, from Carnegie-Mellon University?

It is very similar.  I didn't discover the CorpusBuilder until after I
had my crawler almost completed, otherwise I probably would have
just used theirs! (NLP is not my main research area).

Most of the differences an end user wouldn't see or care about --
for instance, my method of "query generation" is quite different;
my goal being the broadest possible coverage at the risk
of having to throw away a lot of documents not in the desired languages.

I also do some real "crawling": that is, following internal links in the
documents recursively (and giving up on branches when they stop
yielding new documents in the target language).

Another difference seems to be that my software tries to
build the language filter on the fly.   It seems, though,
that they've tried this too (here's the relevant chunk from their paper):

"Our approach performs well at collecting documents in a minority
language starting from a few words or documents but it does require
a language filter for that minority language.  There are filters
available for quite a few language[s] but this is potentially
a limitation of our approach.  In earlier work, we experimented
with constructing a filter on-the-fly... with... encouraging results."

-Kevin

>
>
> http://www-2.cs.cmu.edu/afs/cs/project/theo-4/text-learning/www/corpusbuild
>er/
>
>     Mike Maxwell
>     Linguistic Data Consortium
>     maxwell at ldc.upenn.edu