[Corpora-List] European Constitution in parallel
Joerg Tiedemann
tiedeman at let.rug.nl
Mon Apr 25 11:12:01 UTC 2005
sorry for the confusion. the corpus is available in unicode. all xml-files
are in utf8. it's only in the corpus work bench where I tried to use other
encoding standars (CWB doesn't know unicode and utf8 would make it
difficult to type a query with some diacritics).
but everything that you download from
http://logos.uio.no/opus/EUconst.html
is in utf-8
joerg
On Mon, 25 Apr 2005, Lou Burnard wrote:
> Would it not be possible to make the corpus available in Unicode?
>
> Surely that would be the best solution, especially since you're saving
> it in an XML format.
>
> But many thanks for this effort -- what a great resource!
>
>
>
> Joerg Tiedemann wrote:
>
> >follow-up ....
> >
> >I just realized that there are some additional problems with character
> >encodings. Latvian and Lithuanian should be supported by
> >ISO-8859-4 according to information I found. However, I got serious
> >trouble when converting from UTF-8 to ISO for these languages. Did the
> >alphabet change recently or is the ISO standard just useless?
> >
> >Now, I changed the Latvian and Lithuanian texts from the EUconst corpus to
> >UTF-8 in the CWB index. Looks good but is difficult to query for
> >diacritics. Check:
> >http://logos.uio.no/cgi-bin/opus/opuscqp.pl?corpus=EUconst;lang=lt
> >http://logos.uio.no/cgi-bin/opus/opuscqp.pl?corpus=EUconst;lang=lv
> >
> >Let me know if there is a 8-bit code that can be (is) used for these
> >2 languages.
> >
> >
> >J�rg
> >
> >***********/\/\/\/\/\/\/\/\/\/\/\************************************
> >** J�rg Tiedemann tiedeman at let.rug.nl **
> >** Alfa-Informatica http://www.let.rug.nl/~tiedeman **
> >** Rijksuniversiteit Groningen Harmoniegebouw, room 1311-429 **
> >** Oude Kijk in 't Jatstraat 26 phone: +31 (0)50-363 5935 **
> >** 9712 EK Groningen fax: +31 (0)50-363 6855 **
> >*************************************/\/\/\/\/\/\/\/\/\/\/\**********
> >
> >
> >
> >
> >
> >
>
More information about the Corpora
mailing list