[Corpora-List] corpora of Arab document (coding UTF-8)

Eric Atwell eric at comp.leeds.ac.uk
Sun Jul 6 22:26:43 UTC 2008


Has anyone downloaded a recent set of United Nations English+Arabic documents 
and converted to a corpus format eg XML or even plain text? I am
specifically interested in parallel English+Arabic versions of legal
documents eg human rights ... I know there has been previous mention on
CORPORA but I cant find a link to a Corpus file I can download in one go

thanks for any pointers

Eric Atwell, Leeds University

cf:

2008/2/27 Alexandre Rafalovitch wrote:

> Official documents of the United Nations are translated (by human
> translators) into 6 languages (English, French, Spanish, Russian,
> Chinese, Arabic). They are not unfortunately available in a research
> ready bitexts, but the documents themselves are available from
> http://documents.un.org . There is quite a lot of text there, if one
> is ready to do some non-traditional parsing.
>
> For the last 8-10 years, most of those documents have been available
> in MSWord format. The rest are in PDFs (some with text and some with
> scanned images).
>
> LDC had a very old sample of UN documents; I think that was before
> MSWord versions started to be published, so they had to scan and clean
> their data.
>
> I have more information available, if somebody takes an interest. I am
> doing  research in Named Entity Recognition in that domain, but there
> are enough challenges in the corpora to go around.
>
> Regards,
>    Alex.




_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list