[Corpora-List] New free multiparallel corpus: United Nations GA resolutions (Arabic, Chinese, English, French, Russian, Spanish)
Alexandre Rafalovitch
arafalov at gmail.com
Sat Aug 29 04:56:47 UTC 2009
Hello,
A new corpus has just been made available during Machine Translation
Summit XII conference. Some of you might be interested in it as well.
The corpus and related paper are now available from: http://www.uncorpora.org .
Some basic stats:
*) 6 languages, perfectly aligned on paragraph level: Arabic, Chinese,
English, French, Russian, Spanish
*) ~74000 paragraphs (* 6 languages)
*) ~3M tokens per language
*) Derived from the resolutions of the General Assembly of the United Nations.
*) The corpus is released in TMX (Translation Memory eXchange) form,
ready for processing with Open Source tools like Olifant or by
commercial tools like Trados.
With 3 million tokens per language, the corpus is somewhat small to be
a primary corpus for Machine Translation research, but it could be
useful as a supplementary one, especially for less-resourced languages
like Arabic, Chinese, Russian.
It is also suitable for terminology extraction, named entity
recognition, graph-based analysis techniques and other approaches
interesting within restricted-domain corpus.
It is open for any use (with citation). If you do use it and would
like more like this, a letter of appreciation and usage scenario could
help.
I am happy to field any questions about the corpus in private or public emails.
Regards,
Alex.
Personal blog: http://blog.outerthoughts.com/
Research group: http://www.clt.mq.edu.au/Research/
- I think age is a very high price to pay for maturity (Tom Stoppard)
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list