[Corpora-List] European Constitution in parallel

Joerg Tiedemann tiedeman at let.rug.nl
Mon Apr 25 10:25:55 UTC 2005


follow-up ....

I just realized that there are some additional problems with character 
encodings. Latvian and Lithuanian should be supported by 
ISO-8859-4 according to information I found. However, I got serious 
trouble when converting from UTF-8 to ISO for these languages. Did the 
alphabet change recently or is the ISO standard just useless?

Now, I changed the Latvian and Lithuanian texts from the EUconst corpus to 
UTF-8 in the CWB index. Looks good but is difficult to query for 
diacritics. Check:
http://logos.uio.no/cgi-bin/opus/opuscqp.pl?corpus=EUconst;lang=lt
http://logos.uio.no/cgi-bin/opus/opuscqp.pl?corpus=EUconst;lang=lv

Let me know if there is a 8-bit code that can be (is) used for these 
2 languages.


Jörg

***********/\/\/\/\/\/\/\/\/\/\/\************************************
**  Jörg Tiedemann                 tiedeman at let.rug.nl             **
**  Alfa-Informatica               http://www.let.rug.nl/~tiedeman **  
**  Rijksuniversiteit Groningen     Harmoniegebouw, room 1311-429  **
**  Oude Kijk in 't Jatstraat 26    phone: +31 (0)50-363 5935      **
**  9712 EK Groningen               fax:   +31 (0)50-363 6855      **
*************************************/\/\/\/\/\/\/\/\/\/\/\**********



More information about the Corpora mailing list