[Corpora-List] language-specific harvesting of texts from the Web

Mike Maxwell maxwell at ldc.upenn.edu
Tue Aug 31 17:45:10 UTC 2004


Marco Baroni (Hi, Marco!) wrote:
>>One situation where your approach may not work so well, is when a
>>language's websites use multiple character encodings...
>
> At least for Japanese, our way to get around this problem in our
> web-mining scripts was to look for the charset declaration in the html
> code of each page...
> ...
> Btw: I thought Japanese was tough (as you can find euc-jp, shiftjis, utf8
> and iso-2002-jp), but the situation you describe for Hindi sounds like a
> true encoding nightmare!

For the websites we've looked at (mostly south Asia, particularly India)
and Eritrea (for Tigrinya), there are no charset declarations.  Rather,
it's all font-based: there are html tags (I forget the exact syntax
right now, but it's something like <font face="foobar.ttf">) to indicate
the font to use.  At some sites, text in one font is embedded inside
another, while in others you get a sequence of <font...> ... </font>
sequences.  (Again, I can't remember the exact html tag.)

So once you know what font a particular site is using, you have to find
an encoding converter that someone else has written (if you're lucky),
or write one yourself.  Since the fonts are often undocumented, this is
non-trivial.

It would be bad enough if there were a 1-to-1 mapping between
proprietary code points and Unicode code points.  There usually isn't.

For Indic languages, the encoding usually does not use the same
conventions as Unicode.  For example, the short 'i' in many Indic
scripts appears in writing to the left of the consonant after which it
is pronounced.  In Unicode, the short 'i' character is after the
consonant in the text stream (phonological order), and making it appear
to the left of the consonant is delegated to the rendering system;
whereas in most of the 8-bit encodings, it is before the consonant in
the text stream.

Also, many of these Indic scripts, as well as Amharic and related
scripts (such as Tigrinya) have more than 256 characters, or at least
glyphs.  Some of the latter are for conjoint consonants (where two
adjacent consonants get written together, usually in reduced form), or
other combined symbols.  Unicode generally leaves these
context-sensitive glyphs to the rendering system; 8-bit encodings have
what can be best described as imaginative solutions to this problem.

I might also add that there are some very nice looking websites for
certain Southeast Asian languages (I forget which ones).  Unfortunately,
that's all you can do: look.  Because the web pages are giant gifs, so
there are no text characters you can extract.

>>I gave a talk at the ALLC/ACH meeting in June on our search technique,
>>including its pros and cons.  The abstract was published, but not the
>>full paper.  I suppose I should post it somewhere...
>
> Please do!

I'll see what I can do...

--
	Mike Maxwell
	Linguistic Data Consortium
	maxwell at ldc.upenn.edu



More information about the Corpora mailing list