[Corpora-List] News from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Tue Feb 25 19:45:56 UTC 2014


*Spring 2014 LDC Data Scholarship Recipients*** <#scholar>

** <#scholar>

*2014 Publications Pipeline*** <#member>

** <#member>

/New publications:/

*GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 
2*** <#gale>

** <#gale>

*King Saud University Arabic Speech Database*** <#saud>

** <#saud>

*NIST 2012 Open Machine Translation (OpenMT) Progress Test Five Language 
Source* <#openmt>

<#openmt>
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
** **

*Spring 2014 LDC Data Scholarship Recipients*

LDC is pleased to announce the student recipients of the Spring 2014 LDC 
Data Scholarship program 
<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>!  
This program provides university students with access to LDC data at 
no-cost. Students were asked to complete an application which consisted 
of a proposal describing their intended use of the data, as well as a 
letter of support from their thesis adviser. We received many solid 
applications and have chosen two proposals to support.   The following 
students will receive no-cost copies of LDC data:

  * Skye Anderson ~ Tulane University (USA), BA candidate, Linguistics. 
    Skye has been awarded a copy of LDC Standard Arabic Morphological
    Analyzer (SAMA) Version 3.1 for her work in author profiling.

  * Hao Liu ~ University College London (UK), PhD candidate, Speech,
    Hearing and Phonetic Sciences.  Hao has been awarded a copy of
    Switchboard-1 Release 2, and NXT Switchboard Annotations for his
    work in prosody modeling.


*2014 Publications Pipeline *

LDC's planned publications for this year will include:

  * 2009 NIST Language Recognition Evaluation ~  development data from
    VOA broadcast and CTS telephone speech in target and non-target
    languages.

  * ETS Corpus of Non-Native Written English ~ contains 1100 essays
    written for a college-entrance test sampled from eight prompts
    (i.e., topics) with score levels (low/medium/high) for each essay.

  * GALE data ~ including Word Alignment, Broadcast Speech &
    Transcripts, Parallel Text, Parallel Aligned Treebanks in Arabic,
    Chinese, and English.

  * Hispanic Accented English ~ contains approximately 30 hours of
    spontaneous speech and read utterances from non-native speakers of
    English with corresponding transcripts.

  * Multi-Channel Wall Street Journal Audio-Visual Corpus (MC-WSJ-AV) ~ 
    re-recording of parts of the WSJCAM0 using a number of microphones
    as well as three recording conditions resulting in 18-20 channels of
    audio per recording.

  * TAC KBP Reference Knowledge Base ~  TAC KBP aims to develop and
    evaluate technologies for building and populating knowledge bases
    (KBs) about named entities from unstructured text.  KBP systems must
    either populate an existing reference KB, or else build a KB from
    scratch. The reference KB for is based on a snapshot of English
    Wikipedia snapshot from October 2008 and contains a set of entities,
    each with a canonical name and title for the Wikipedia page, an
    entity type, an automatically parsed version of the data from the
    infobox in the entity's Wikipedia article, and a stripped version of
    the text of the Wiki article.

  * USC-SFI MALACH Interviews and Transcripts Czech ~ developed by The
    University of Southern California's Shoah Foundation Institute
    (USC-SFI) and the University of West Bohemia as part of the MALACH
    (Multilingual Access to Large Spoken ArCHives) Project. It contains
    approximately 143 hours of interviews from 420 interviewees along
    with transcripts and other documentation.

Visit LDC's Obtaining Data 
<https://www.ldc.upenn.edu/language-resources/data/obtaining> page for 
information on membership and data licensing.


*New publications
*

(1) GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 
2 <http://catalog.ldc.upenn.edu/LDC2014T03> was developed by LDC and 
contains 141,058 tokens of word aligned Arabic and English parallel text 
with treebank annotations. This material was used as training data in 
the DARPA GALE (Global Autonomous Language Exploitation) program.

Parallel aligned treebanks are treebanks annotated with morphological 
and syntactic structures aligned at the sentence level and the 
sub-sentence level. Such data sets are useful for natural language 
processing and related fields, including automatic word alignment system 
training and evaluation, transfer-rule extraction, word sense 
disambiguation, translation lexicon extraction and cultural heritage and 
cross-linguistic studies. With respect to machine translation system 
development, parallel aligned treebanks may improve system performance 
with enhanced syntactic parsers, better rules and knowledge about 
language pairs and reduced word error rate.

In this release, the source Arabic data was translated into English. 
Arabic and English treebank annotations were performed independently. 
The parallel texts were then word aligned. The material in this corpus 
corresponds to a portion of the Arabic treebanked data in Arabic 
Treebank - Broadcast News v1.0 (LDC2012T07 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T07>).

The source data consists of Arabic broadcast news programming collected 
by LDC in 2007 and 2008. All data is encoded as UTF-8. A count of files, 
words, tokens and segments is below.

Language

	

Files

	

Words

	

Tokens

	

Segments

Arabic

	

31

	

110,690

	

141,058

	

7,102

The purpose of the GALE word alignment task was to find correspondences 
between words, phrases or groups of words in a set of parallel texts. 
Arabic-English word alignment annotation consisted of the following tasks:

  * Identifying different types of links: translated (correct or
    incorrect) and not translated (correct or incorrect)
  * Identifying sentence segments not suitable for annotation, e.g.,
    blank segments, incorrectly-segmented segments, segments with
    foreign languages
  * Tagging unmatched words attached to other words or phrases


(2) King Saud University Arabic Speech Database 
<http://catalog.ldc.upenn.edu/LDC2014S02> was developed by King Saud 
University <http://ksu.edu.sa/en/> and contains 590 hours of recorded 
Arabic speech from male and female speakers. The utterances include read 
and spontaneous speech. The recordings were conducted in varied 
environments representing quiet and noisy settings.

The corpus was designed principally for speaker recognition research. 
The speech sources are sentences, word lists, prose and question and 
answer sessions. Read speech text includes the following:

  * Sets of sentences devised to cover allophones of each phoneme,
    phonetic balance, and differentiation of accents.
  * Word lists developed to minimize missing phonemes and to represent
    nasals fricatives, commonly used words, and numbers.
  * Two paragraphs, one from the Quran and another from a book, selected
    because they included all letters of the alphabet and were easy to read.

Spontaneous speech was captured through question and answer sessions 
between participants and project team members. Speakers responded to 
questions on general topics such as the weather and food.

Each speaker was recorded in three different environments: a sound proof 
room, an office, and a cafeteria. The recordings were collected via 
microphone and mobile phone and averaged between 16-19 minutes. The data 
was verified for missing recordings, problems with the recording system 
or errors in the recording process.


(3) NIST 2012 Open Machine Translation (OpenMT) Progress Test Five 
Language Source <http://catalog.ldc.upenn.edu/LDC2014T02> was developed 
by NIST Multimodal Information Group <http://nist.gov/itl/iad/mig/>. 
This release contains the evaluation sets (source data and human 
reference translations), DTD, scoring software, and evaluation plan for 
the OpenMT 2012 test for Arabic, Chinese, Dari, Farsi, and Korean to 
English on a parallel data set. The set is based on a subset of the 
Arabic-to-English and Chinese-to-English progress tests from the OpenMT 
2008, 2009 and 2012 evaluations with new source data created by humans 
based on the English reference translation. The package was compiled, 
and scoring software was developed, at NIST, making use of newswire and 
web data and reference translations developed by the Linguistic Data 
Consortium and the Defense Language Institute Foreign Language Center 
<http://www.dliflc.edu/>.

The objective of the OpenMT evaluation series is to support research in, 
and help advance the state of the art of, machine translation (MT) 
technologies -- technologies that translate text between human 
languages. Input may include all forms of text. The goal is for the 
output to be an adequate and fluent translation of the original. The 
2012 task included the evaluation of five language pairs: 
Arabic-to-English, Chinese-to-English, Dari-to-English, Farsi-to-English 
and Korean-to-English in two source data styles. For general information 
about the NIST OpenMT evaluations, refer to the NIST OpenMT website 
<http://www.nist.gov/itl/iad/mig/openmt.cfm>.

This evaluation kit includes a single Perl script (mteval-v13a.pl) that 
may be used to produce a translation quality score for one (or more) MT 
systems. The script works by comparing the system output translation 
with a set of (expert) reference translations of the same source text. 
Comparison is based on finding sequences of words in the reference 
translations that match word sequences in the system output translation.

This release consists of 20 files, four for each of the five languages, 
presented in XML with an included DTD. The four files are source and 
reference data in the following two styles:

  * English-true: an English-oriented translation this requires that the
    text read well and not use any idiomatic expressions in the foreign
    language to convey meaning, unless absolutely necessary.
  * Foreign-true: a translation as close as possible to the foreign
    language, as if the text had originated in that language.


------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------

-- 
--

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140225/d4fb5118/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list