[Corpora-List] language-specific harvesting of texts from the Web
Marco Baroni
baroni at einstein.sslmit.unibo.it
Tue Aug 31 16:51:46 UTC 2004
> One situation where your approach may not work so well, is when a
> language's websites use multiple character encodings. Unfortunately,
> this is quite common in languages that have non-Roman writing systems,
At least for Japanese, our way to get around this problem in our
web-mining scripts was to look for the charset declaration in the html
code of each page, and then to convert (inside the script) the page from
that charset to utf8.
I would be interested in hearing about other ways to deal with multiple
encodings.
Btw: I thought Japanese was tough (as you can find euc-jp, shiftjis, utf8
and iso-2002-jp), but the situation you describe for Hindi sounds like a
true encoding nightmare!
> I gave a talk at the ALLC/ACH meeting in June on our search technique,
> including its pros and cons. The abstract was published, but not the
> full paper. I suppose I should post it somewhere...
Please do!
Regards,
Marco
--
Marco Baroni
University of Bologna
http://sslmit.unibo.it/~baroni
More information about the Corpora
mailing list