[Corpora-List] language-specific harvesting of texts from the Web

Kevin Patrick Scannell scannell at slu.edu
Tue Aug 31 17:16:47 UTC 2004


On Tuesday 31 August 2004 11:11 am, Mike Maxwell wrote:
> Mark P. Line wrote:
>  > I've been playing with Google searches for extracting texts in a
>  > particular language from the Web without a lot of noise (i.e. few
>  > texts that aren't in the desired language). Any comments on the
>  > utility of this approach for more serious corpus research?
>
> I've been using basically this approach to find websites for a number of
> languages (Bengali, Tamil, Panjabi, Tagalog, Tigrinya and Uzbek).

 I have (yet another) tool taking essentially the same approach:

http://borel.slu.edu/crubadan/

It is based on the Google API, wget, etc.   I mentioned it
on this list sometime in the spring.

  I am planning on releasing the source code as soon as I get a chance
to tidy things up a bit.   The real feature of the program is that
it can bootstrap the language model from a pretty minimal amount
of seed text.     The queries are generated by automatically by finding
candidate stopwords from the top of the frequency list (and filtering
out words near the top of other languages' frequency lists)
and then randomly adding in words from the rest of the corpus
"OR"'d together.  The crawler is running and collecting text for
more than 150 languages at the moment:

http://borel.slu.edu/crubadan/stadas.html

I have a small army of open source volunteers who are native speakers
of one or more of the languages helping create spell checking word
lists and helping to deal with some of the issues that Mike
Maxwell raised in his message (odd character encodings,
separating dialects/orthographies, etc.).   Mike covered most
of the important difficulties that arise so I don't have much
to add other than the offer to answer any questions about the implementation,
or in fact to run the crawler on behalf on anyone willing
to send me some seed text in your target language.

Kevin



More information about the Corpora mailing list