Arabic-L:LING:Free Multiparallel corpus UN GA resolutions (including Arabic)

Dilworth Parkinson dil at BYU.EDU
Wed Sep 9 22:43:13 UTC 2009


------------------------------------------------------------------------
Arabic-L: Wed 09 Sep 2009
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
             unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject:Free Multiparallel corpus UN GA resolutions (including  
Arabic)

-------------------------Messages-----------------------------------
1)
Date: 09 Sep 2009
From:reposted from CORPORA (arafalov at gmail.com)
Subject:Free Multiparallel corpus UN GA resolutions (including Arabic)

A new corpus has just been made available during Machine Translation
Summit XII conference. Some of you might be interested in it as well.

The corpus and related paper are now available from: http://www.uncorpora.org 
  .

Some basic stats:

*) 6 languages, perfectly aligned on paragraph level: Arabic, Chinese,
English, French, Russian, Spanish
*) ~74000 paragraphs (* 6 languages)
*) ~3M tokens per language
*) Derived from the resolutions of the General Assembly of the United  
Nations.
*) The corpus is released in TMX (Translation Memory eXchange) form,
ready for processing with Open Source tools like Olifant or by
commercial tools like Trados.

With 3 million tokens per language, the corpus is somewhat small to be
a primary corpus for Machine Translation research, but it could be
useful as a supplementary one, especially for less-resourced languages
like Arabic, Chinese, Russian.

It is also suitable for terminology extraction, named entity
recognition, graph-based analysis techniques and other approaches
interesting within restricted-domain corpus.

It is open for any use (with citation). If you do use it and would
like more like this, a letter of appreciation and usage scenario could
help.

I am happy to field any questions about the corpus in private or  
public emails.

Regards,
    Alex.

--------------------------------------------------------------------------
End of Arabic-L:  09 Sep 2009


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20090909/17489fb0/attachment.htm>


More information about the Arabic-l mailing list