29.1217, FYI: March 2018 Newsletter - LDC

Fri Mar 16 18:51:36 UTC 2018

LINGUIST List: Vol-29-1217. Fri Mar 16 2018. ISSN: 1069 - 4875.

Subject: 29.1217, FYI: March 2018 Newsletter - LDC

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté,
                                   Michael Czerniakowski)
Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Kenneth Steimel <ken at linguistlist.org>
================================================================

Date: Fri, 16 Mar 2018 14:51:27
From: Membership Office [ldc at ldc.upenn.edu]
Subject: March 2018 Newsletter - LDC

In this newsletter: 

New Publications:
BOLT Arabic Discussion Forums
LORELEI Somali Representative Language Pack - Monolingual and Parallel Text
SPADE (Syntactic Phrase Alignment Dataset for Evaluation)

New publications:

(1) BOLT Arabic Discussion Forums was developed by LDC and consists of 813,080
discussion forum threads in Egyptian Arabic harvested from the Internet using
a combination of manual and automatic processes. The DARPA BOLT (Broad
Operational Language Translation) program developed machine translation and
information retrieval for less formal genres, focusing particularly on
user-generated content. The material in this release represents the
unannotated Arabic source data in the discussion forum genre.

Collection was seeded based on the results of manual data scouting by native
speaker annotators. Scouts were instructed to seek content in Egyptian Arabic
that was original, interactive and informal. Upon locating an appropriate
thread, scouts submitted the URL and some simple judgments about it to a
database, via a web browser plug-in. The scale of the collection precluded
manual review of all data. Only a small portion of the threads included in
this release were manually reviewed, and it is expected that there may be some
offensive or otherwise undesired content as well as some threads that contain
a large amount of non-Arabic content. It should also be noted that many
threads may contain a mixture of Egyptian and other varieties of Arabic, even
among the threads that are primarily Arabic.

BOLT Arabic Discussion Forums is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

(2) LORELEI Somali Representative Language Pack - Monolingual and Parallel
Text was developed by LDC and is comprised of approximately 13 million words
of monolingual Somali text, approximately 800,000 of which are translated into
English. Another 100,000 words are also translated from English into Somali.
The LORELEI (Low Resource Languages for Emergent Incidents) Program is
concerned with building Human Language Technology for low resource languages
in the context of emergent situations like natural disasters or disease
outbreaks. 

Data was collected in the following genres: discussion forums, news,
reference, social network and weblog. Both monolingual text collection and
parallel text creation involved a combination of manual and automatic methods,
which are detailed in the included documentation. All harvested content was
initially converted from its original HTML form into a relatively uniform XML
format. Also included in this release are two tools: one to recreate original
source data from the processed XML material and the other to condition text
data users download from Twitter.

LORELEI Somali Representative Language Pack - Monolingual and Parallel Text is
distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

(3) SPADE (Syntactic Phrase Alignment Dataset for Evaluation) consists of
annotated parse trees and alignment on English sentential paraphrases
extracted from machine translation evaluation corpora and separated into
development and test sets.

Reference translations from machine translation evaluation corpora were used
as sentential paraphrases. They were sourced from the following data sets
released by LDC from the NIST (National Institute of Standards and Technology)
open machine translation evaluation series (OpenMT): LDC2010T14, LDC2010T17,
LDC2010T21, and LDC2013T03.

Reference translations of 10 to 30 words were randomly extracted for
annotation from NIST OpenMT corpora. Gold standard annotations of HPSG
(head-driven phrase structure grammar) trees and phrase alignments were
performed, resulting in 20,276 phrases extracted from 201 sentential
paraphrases and 15,721 paraphrase alignments. 

SPADE is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

Membership Office
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:
            http://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-29-1217	
----------------------------------------------------------