[Corpora-List] language-specific harvesting of texts from the Web

Stuart A Yeates stuart.yeates at computing-services.oxford.ac.uk
Wed Sep 1 07:55:11 UTC 2004


Marco Baroni wrote:
>>One situation where your approach may not work so well, is when a
>>language's websites use multiple character encodings.  Unfortunately,
>>this is quite common in languages that have non-Roman writing systems,
>
>
> At least for Japanese, our way to get around this problem in our
> web-mining scripts was to look for the charset declaration in the html
> code of each page, and then to convert (inside the script) the page from
> that charset to utf8.
>
> I would be interested in hearing about other ways to deal with multiple
> encodings.

textcat (http://odur.let.rug.nl/~vannoord/TextCat/) is a language and
encoding guesser which reliably guesses test language and encoding based
solely on examples and statistics. Knows 69 natural languages. Open source.

I've had good experiance using the built-in java encoding converters
(readers and writers shipped for ~100 encodings as standard) to convert
between languages. Freely avaliable.

cheers
stuart
--
Stuart Yeates            stuart.yeates at computing-services.oxford.ac.uk
OSS Watch                                  http://www.oss-watch.ac.uk/
Oxford Text Archive                             http://ota.ahds.ac.uk/
Humbul Humanities Hub                         http://www.humbul.ac.uk/



More information about the Corpora mailing list