27.2370, FYI: Official Release of the UN Parallel Corpus v1.0

The LINGUIST List via LINGUIST linguist at listserv.linguistlist.org
Fri May 27 16:30:27 UTC 2016


LINGUIST List: Vol-27-2370. Fri May 27 2016. ISSN: 1069 - 4875.

Subject: 27.2370, FYI: Official Release of the UN Parallel Corpus v1.0

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Anthony Aristar, Helen Aristar-Dry, Robert Coté, Sara Couture)
Homepage: http://linguistlist.org

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
                   25 years of LINGUIST List!
Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Ashley Parker <ashley at linguistlist.org>
================================================================


Date: Fri, 27 May 2016 12:30:18
From: Marcin Junczys-Dowmunt [junczys at amu.edu.pl]
Subject: Official Release of the UN Parallel Corpus v1.0

 
Dear all,

I would like to announce the official release of the United Nations Parallel
Corpus v1.0. The corpus was created as part of the United Nations commitment
to multilingualism and as a reaction to the growing importance of statistical
machine translation (SMT) within the Department for General Assembly and
Conference Management (DGACM) translation services and the United Nations. It
covers 25 years, from 1990 to 2014, and contains documents in the six official
languages of the United Nations: Arabic, Chinese, English, French, Russian,
and Spanish.

The purpose of the corpus is to allow access to multilingual language
resources and facilitate research and progress in various natural language
processing tasks, including machine translation. For convenience, the corpus
is also available pre-packaged as bi-texts for each language pair.

A subset of the corpus is available as a six-language fully-parallel corpus,
i.e. all sentences have equivalents in all six languages. Data from 2015 has
been used to created official development sets and test sets, also fully
aligned across the six official UN languages. The paper reports SMT baselines
for all languages pairs for this corpus.

The corpus is available at:

http://conferences.unite.un.org/UNCorpus

The corresponding publication is available at:

http://www.lrec-conf.org/proceedings/lrec2016/pdf/1195_Paper.pdf

While registering, please leave a short description of the work for which you
plan to use the corpus. In the near future we plan to set up a section with
references to papers that describe research done with UN corpus. Feel free to
share links and bibliography items with us (either with me or any of the
authors of the above paper).

Marcin Junczys-Dowmunt
 



Linguistic Field(s): Computational Linguistics
                     Text/Corpus Linguistics
                     Translation

Subject Language(s): Arabic, Standard (arb)
                     Chinese, Mandarin (cmn)
                     English (eng)
                     French (fra)
                     Russian (rus)
                     Spanish (spa)





 



------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
Please support the LL editors and operation with a donation at:
            http://funddrive.linguistlist.org/donate/

This year the LINGUIST List hopes to raise $79,000. This money 
will go to help keep the List running by supporting all of our 
Student Editors for the coming year.

Don't forget to check out Fund Drive 2016 site!

http://funddrive.linguistlist.org/

For all information on donating, including information on how to 
donate by check, money order, PayPal or wire transfer, please visit:
http://funddrive.linguistlist.org/donate/

The LINGUIST List is under the umbrella of Indiana University and
as such can receive donations through Indiana University Foundation. We
also collect donations via eLinguistics Foundation, a registered 501(c)
Non Profit organization with the federal tax number 45-4211155. Either
way, the donations can be offset against your federal and sometimes your
state tax return (U.S. tax payers only). For more information visit the
IRS Web-Site, or contact your financial advisor.

Many companies also offer a gift matching program, such that
they will match any gift you make to a non-profit organization.
Normally this entails your contacting your human resources department
and sending us a form that the Indiana University Foundation fills in
and returns to your employer. This is generally a simple administrative
procedure that doubles the value of your gift to LINGUIST, without
costing you an extra penny. Please take a moment to check if
your company operates such a program.


Thank you very much for your support of LINGUIST!
 


----------------------------------------------------------
LINGUIST List: Vol-27-2370	
----------------------------------------------------------







More information about the LINGUIST mailing list