[Corpora-List] European Constitution in parallel
Joerg Tiedemann
tiedeman at let.rug.nl
Mon Apr 25 10:25:55 UTC 2005
follow-up ....
I just realized that there are some additional problems with character
encodings. Latvian and Lithuanian should be supported by
ISO-8859-4 according to information I found. However, I got serious
trouble when converting from UTF-8 to ISO for these languages. Did the
alphabet change recently or is the ISO standard just useless?
Now, I changed the Latvian and Lithuanian texts from the EUconst corpus to
UTF-8 in the CWB index. Looks good but is difficult to query for
diacritics. Check:
http://logos.uio.no/cgi-bin/opus/opuscqp.pl?corpus=EUconst;lang=lt
http://logos.uio.no/cgi-bin/opus/opuscqp.pl?corpus=EUconst;lang=lv
Let me know if there is a 8-bit code that can be (is) used for these
2 languages.
Jörg
***********/\/\/\/\/\/\/\/\/\/\/\************************************
** Jörg Tiedemann tiedeman at let.rug.nl **
** Alfa-Informatica http://www.let.rug.nl/~tiedeman **
** Rijksuniversiteit Groningen Harmoniegebouw, room 1311-429 **
** Oude Kijk in 't Jatstraat 26 phone: +31 (0)50-363 5935 **
** 9712 EK Groningen fax: +31 (0)50-363 6855 **
*************************************/\/\/\/\/\/\/\/\/\/\/\**********
More information about the Corpora
mailing list