[Corpora-List] language-specific harvesting of texts from the Web

Tue Aug 31 16:11:28 UTC 2004

Mark P. Line wrote:

 > I've been playing with Google searches for extracting texts in a
 > particular language from the Web without a lot of noise (i.e. few
 > texts that aren't in the desired language). Any comments on the
 > utility of this approach for more serious corpus research?

I've been using basically this approach to find websites for a number of
languages (Bengali, Tamil, Panjabi, Tagalog, Tigrinya and Uzbek).
Earlier we used this, or something quite similar, for Hindi and Cebuano,
and I've experimented with it for Tzeltal and Shuar.  It is easy to
extend to other languages; basically, you just look in a dictionary or
grammar for a few function words.  Once I find a website, tools like
wget will allow you to build a corpus; then you can test whether a given
file from that site is in the language by various other means.  (If the
language has a specific Unicode range, testing is trivial.)

You may get some interference with closely related languages.  Your
Tagalog search, for example, might be bringing up pages in other
Philippine languages.  (I don't know that it is, since I don't know
Tagalog--requiring that 'ang' and 'may' be adjacent probably prevents
this.  If you had left off the leading and trailing quotes, I guess
there would have been a greater chance of lowering your precision.)

You can of course do these sorts of searches with the Google API, which
allows you to semi-automate the downloads.  I've done that to find all
the pages at a given site that are in some language, where techniques
like 'wget' didn't work.

More sophosticated methods, i.e. tools like CorpusBuilder, are needed
when you want to build an exhaustive corpus of some language, and you
have the time to build a language filter.

One situation where your approach may not work so well, is when a
language's websites use multiple character encodings.  Unfortunately,
this is quite common in languages that have non-Roman writing systems,
such as the Indic languages, or Tigrinya (and I imagine Amharic,
although I haven't tried it there).  For Hindi, which is the worst case
we've seen yet, virtually every newspaper site had its own proprietary
(=undocumented) encoding, and one site (the Indian parliament) claimed
to use five different proprietary encodings.  (I'm not sure they really
did, but they did suggest downloading five different fonts.)  The
multiple character encoding problem doesn't reduce your precision, which
is what you say you're really interested in, but it will definitely
reduce your recall.  When last I looked, the only Hindi news sites using
Unicode were the Voice of America and the BBC.  There were a number of
other Hindi websites using Unicode, but they tended to be in countries
other than India; two that come to mind were a museum in Australia, and
Colgate.  I think there's next to nothing in Tigrinya in Unicode,
whereas there is a fair amount (I won't say a lot) in other encodings.

Variant spelling systems can also cause problems.  You won't run into
this for major languages, but you may for recently written languages
(Mayan and Quechuan languages) or languages of the former Soviet Union
(Chechen is a case in point).  I thought it might be the case with
Nahuatl, but apparently the c/qu vs. k issue isn't so "hot" for Nahuatl
languages as they are for some other languages of Latin America.

The same method can of course be used for non-Unicode non-Roman
websites; you just have to find some such websites to start with, so you
know how to spell the words in whatever encoding they're using.

I recently ran into some bizarre pseudo-Unicode websites in Bengali.
They use HTML character entities for Unicode codepoints, but not all the
codepoints are actually in the Bengali section of Unicode--they appear
to be using other "Unicode" (scare quotes intentional) codepoints for
contextual variants of characters.  BTW, Google treats HTML character
entities as if they were ordinary Unicode codepoints, which simplifies
search.

I gave a talk at the ALLC/ACH meeting in June on our search technique,
including its pros and cons.  The abstract was published, but not the
full paper.  I suppose I should post it somewhere...

--
     Mike Maxwell
     Linguistic Data Consortium
     maxwell at ldc.upenn.edu