[Corpora-List] language-specific harvesting of texts from the Web

Marco Baroni baroni at einstein.sslmit.unibo.it
Tue Aug 31 16:51:46 UTC 2004


> One situation where your approach may not work so well, is when a
> language's websites use multiple character encodings.  Unfortunately,
> this is quite common in languages that have non-Roman writing systems,

At least for Japanese, our way to get around this problem in our
web-mining scripts was to look for the charset declaration in the html
code of each page, and then to convert (inside the script) the page from
that charset to utf8.

I would be interested in hearing about other ways to deal with multiple
encodings.

Btw: I thought Japanese was tough (as you can find euc-jp, shiftjis, utf8
and iso-2002-jp), but the situation you describe for Hindi sounds like a
true encoding nightmare!

> I gave a talk at the ALLC/ACH meeting in June on our search technique,
> including its pros and cons.  The abstract was published, but not the
> full paper.  I suppose I should post it somewhere...

Please do!

Regards,

Marco

--
Marco Baroni
University of Bologna
http://sslmit.unibo.it/~baroni



More information about the Corpora mailing list