[Corpora-List] New from LDC

Tue Feb 26 21:55:28 UTC 2013

*Spring 2013 LDC Data Scholarship Recipients <#scholar>***

/New publications:/

*GALE Phase 2 Arabic Broadcast Conversation Speech Part 1 <#gale1>***

**

*GALE Phase 2 Arabic Broadcast Conversation Transcripts - Part 1 <#gale2>***

**

***NIST 2012 Open Machine Translation (OpenMT) Evaluation <#mt>*

------------------------------------------------------------------------

*Spring 2013 LDC Data Scholarship Recipients*

LDC is pleased to announce the student recipients of the Spring 2013 LDC 
Data Scholarship program!  This program provides university students 
with access to LDC data at no-cost. Students were asked to complete an 
application which consisted of a proposal describing their intended use 
of the data, as well as a letter of support from their thesis adviser. 
We received many solid applications and have chosen three proposals to 
support.   The following students will receive no-cost copies of LDC data:

    Salima Harrat - Ecole Supérieure d'informatique (ESI) (Algeria). 
    Salima has been awarded a copy of /Arabic Treebank: Part 3/ for her
    work in diacritization restoration.

    Maulik C. Madhavi - Dhirubhai Ambani Institute of Information and
    Communication Technology (DA-IICT), Gandhinagar (India). Maulik has
    been awarded a copy of /Switchboard Cellular Part 1 Transcribed
    Audio and Transcripts/ and /1997 HUB4 English Evaluation Speech and
    Transcripts/ for his work in spoken term detection.

    Shereen M. Oraby - Arab Academy for Science, Technology, and
    Maritime Transport (Egypt).  Shereen has been awarded a copy of
    /Arabic Treebank: Part 1/ for her work in subjectivity and sentiment
    analysis.

Please join us in congratulating our student recipients!   The next LDC 
Data Scholarship program is scheduled for the Fall 2013 semester.

*New publications*

(1) GALE Phase 2 Arabic Broadcast Conversation Speech Part 1 
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2013S02> 
was developed by LDC and is comprised of approximately 123 hours of 
Arabic broadcast conversation speech collected in 2006 and 2007 by LDC 
as part of the DARPA GALE (Global Autonomous Language Exploitation) 
Program. Broadcast audio for the DARPA GALE program was collected at 
LDC's Philadelphia, PA USA facilities and at three remote collection 
sites. The combined local and outsourced broadcast collection supported 
GALE at a rate of approximately 300 hours per week of programming from 
more than 50 broadcast sources for a total of over 30,000 hours of 
collected broadcast audio over the life of the program.

LDC's local broadcast collection system is highly automated, easily 
extensible and robust and capable of collecting, processing and 
evaluating hundreds of hours of content from several dozen sources per 
day. The broadcast material is served to the system by a set of 
free-to-air (FTA) satellite receivers, commercial direct satellite 
systems (DSS) such as DirecTV, direct broadcast satellite (DBS) 
receivers, and cable television (CATV) feeds. The mapping between 
receivers and recorders is dynamic and modular; all signal routing is 
performed under computer control, using a 256x64 A/V matrix switch. 
Programs are recorded in a high bandwidth A/V format and are then 
processed to extract audio, to generate keyframes and compressed 
audio/video, to produce time-synchronized closed captions (in the case 
of North American English) and to generate automatic speech recognition 
(ASR) output.

The broadcast conversation recordings in this release feature 
interviews, call-in programs and round table discussions focusing 
principally on current events from several sources. This release 
contains 143 audio files presented in .wav, 16000 Hz single-channel 
16-bit PCM. Each file was audited by a native Arabic speaker following 
Audit Procedure Specification Version 2.0 which is included in this 
release. The broadcast auditing process served three principal goals: as 
a check on the operation of LDCs broadcast collection system equipment 
by identifying failed, incomplete or faulty recordings; as an indicator 
of broadcast schedule changes by identifying instances when the 
incorrect program was recorded; and as a guide for data selection by 
retaining information about a program's genre, data type and topic.

*

(2) GALE Phase 2 Arabic Broadcast Conversation Transcripts - Part 1 
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2013T04> 
was developed by LDC and contains transcriptions of approximately 123 
hours of Arabic broadcast conversation speech collected in 2006 and 2007 
by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 
of the DARPA GALE (Global Autonomous Language Exploitation) program. The 
source broadcast conversation recordings feature interviews, call-in 
programs and round table discussions focusing principally on current 
events from several sources.

The transcript files are in plain-text, tab-delimited format (TDF) with 
UTF-8 encoding, and the transcribed data totals 752,747 tokens. The 
transcripts were created with the LDC-developed transcription tool, 
XTrans <http://www.ldc.upenn.edu/tools/XTrans/downloads/>, a 
multi-platform, multilingual, multi-channel transcription tool that 
supports manual transcription and annotation of audio recordings.

The files in this corpus were transcribed by LDC staff and/or by 
transcription vendors under contract to LDC. Transcribers followed LDCs 
quick transcription guidelines (QTR) and quick rich transcription 
specification (QRTR) both of which are included in the documentation 
with this release. QTR transcription consists of quick (near-)verbatim, 
time-aligned transcripts plus speaker identification with minimal 
additional mark-up. It does not include sentence unit annotation. QRTR 
annotation adds structural information such as topic boundaries and 
manual sentence unit annotation to the core components of a quick 
transcript. Files with QTR as part of the filename were developed using 
QTR transcription. Files with QRTR in the filename indicate QRTR 
transcription.

*

(3) NIST 2012 Open Machine Translation (OpenMT) Evaluation 
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2013T03> 
was developed by NIST Multimodal Information Group 
<http://nist.gov/itl/iad/mig/>. This release contains source data, 
reference translations and scoring software used in the NIST 2012 OpenMT 
evaluation, specifically, for the Chinese-to-English language pair 
track. The package was compiled and scoring software was developed at 
NIST, making use of Chinese newswire and web data and reference 
translations collected and developed by LDC. The objective of the OpenMT 
evaluation series is to support research in, and help advance the state 
of the art of, machine translation (MT) technologies -- technologies 
that translate text between human languages. Input may include all forms 
of text. The goal is for the output to be an adequate and fluent 
translation of the original.

The 2012 task was to evaluate five language pairs: Arabic-to-English, 
Chinese-to-English, Dari-to-English, Farsi-to-English and 
Korean-to-English. This release consists of the material used in the 
Chinese-to-English language pair track. For more general information 
about the NIST OpenMT evaluations, please refer to the NIST OpenMT 
website <http://www.nist.gov/itl/iad/mig/openmt.cfm>.

This evaluation kit includes a single Perl script (mteval-v13a.pl) that 
may be used to produce a translation quality score for one (or more) MT 
systems. The script works by comparing the system output translation 
with a set of (expert) reference translations of the same source text. 
Comparison is based on finding sequences of words in the reference 
translations that match word sequences in the system output translation.

This release contains 222 documents with corresponding source and 
reference files, the latter of which contains four independent human 
reference translations of the source data. The source data is comprised 
of Chinese newswire and web data collected by LDC in 2011. A portion of 
the web data concerned the topic of food and was treated as a restricted 
domain. The table below displays statistics by source, genre, documents, 
segments and source tokens.

*Source***

*Genre***

*Documents***

*Segments***

*Source Tokens***

Chinese General

Newswire

45

400

18184

Chinese General

Web Data

28

420

15181

Chinese Restricted Domain

Web Data

149

2184

48422

The token counts for Chinese data are "character" counts, which were 
obtained by counting tokens matching the UNICODE-based regular 
expression "/w". The Python "re" module was used to obtain those counts.

------------------------------------------------------------------------

-- 
--

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130226/d1565431/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora