[Corpora-List] European Constitution in parallel

Lou Burnard lou.burnard at computing-services.oxford.ac.uk
Mon Apr 25 11:06:41 UTC 2005

Would it not be possible to make the corpus available in Unicode?

Surely that would be the best solution, especially since you're saving
it in an XML format.

But many thanks for this effort -- what a great resource!

Joerg Tiedemann wrote:

>follow-up ....
>I just realized that there are some additional problems with character
>encodings. Latvian and Lithuanian should be supported by
>ISO-8859-4 according to information I found. However, I got serious
>trouble when converting from UTF-8 to ISO for these languages. Did the
>alphabet change recently or is the ISO standard just useless?
>Now, I changed the Latvian and Lithuanian texts from the EUconst corpus to
>UTF-8 in the CWB index. Looks good but is difficult to query for
>diacritics. Check:
>Let me know if there is a 8-bit code that can be (is) used for these
>2 languages.
>**  J�rg Tiedemann                 tiedeman at let.rug.nl             **
>**  Alfa-Informatica               http://www.let.rug.nl/~tiedeman **
>**  Rijksuniversiteit Groningen     Harmoniegebouw, room 1311-429  **
>**  Oude Kijk in 't Jatstraat 26    phone: +31 (0)50-363 5935      **
>**  9712 EK Groningen               fax:   +31 (0)50-363 6855      **

More information about the Corpora mailing list