<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p class="MsoNormal" align="center"><b><a href="#scholar">Spring
2013 LDC Data Scholarship Recipients</a></b><b><o:p></o:p></b></p>
<o:p></o:p>
<p class="MsoNormal" align="center"><i>New publications:</i><o:p></o:p></p>
<p class="MsoNormal" align="center"> <b><a href="#gale1">GALE Phase
2 Arabic Broadcast Conversation Speech Part 1</a></b><b><o:p></o:p></b></p>
<b> </b>
<p class="MsoNormal" align="center"> <b><a href="#gale2">GALE Phase
2 Arabic Broadcast Conversation Transcripts - Part 1</a></b><b><o:p></o:p></b></p>
<b> </b>
<p class="MsoNormal" align="center"><b> </b><b><a href="#mt">NIST
2012 Open Machine Translation (OpenMT) Evaluation</a></b></p>
<o:p></o:p>
<div class="MsoNormal" style="text-align:center" align="center">
<hr size="2" width="100%" align="center"> </div>
<p class="MsoNormal" align="center"><a name="scholar"></a><b>Spring
2013 LDC Data Scholarship Recipients</b><o:p></o:p></p>
<p class="MsoNormal">LDC is pleased to announce the student
recipients of the Spring 2013 LDC Data Scholarship program! This
program provides university students with access to LDC data at
no-cost. Students were asked to complete an application which
consisted of a proposal describing their intended use of the data,
as well as a letter of support from their thesis adviser. We
received many solid applications and have chosen three proposals
to support. The following students will receive no-cost copies
of LDC data: <o:p></o:p></p>
<blockquote>
<p class="MsoNormal">Salima Harrat - Ecole Supérieure
d’informatique (ESI) (Algeria). Salima has been awarded a copy
of <i>Arabic Treebank: Part 3</i> for her work in
diacritization restoration.<br>
<br>
Maulik C. Madhavi - Dhirubhai Ambani Institute of Information
and Communication Technology (DA-IICT), Gandhinagar (India).
Maulik has been awarded a copy of <i>Switchboard Cellular Part
1 Transcribed Audio and Transcripts</i> and <i>1997 HUB4
English Evaluation Speech and Transcripts</i> for his work in
spoken term detection.<br>
<br>
Shereen M. Oraby - Arab Academy for Science, Technology, and
Maritime Transport (Egypt). Shereen has been awarded a copy of
<i>Arabic Treebank: Part 1</i> for her work in subjectivity and
sentiment analysis. <o:p></o:p></p>
</blockquote>
<p class="MsoNormal">Please join us in congratulating our student
recipients! The next LDC Data Scholarship program is scheduled
for the Fall 2013 semester. <o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<div align="center"><b>New publications</b><o:p></o:p></div>
<p class="MsoNormal"><o:p> </o:p>(1) <a
href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2013S02">GALE
Phase
2 Arabic Broadcast Conversation Speech Part 1</a> was developed
by LDC and is comprised of approximately 123 hours of Arabic
broadcast conversation speech collected in 2006 and 2007 by LDC as
part of the DARPA GALE (Global Autonomous Language Exploitation)
Program. Broadcast audio for the DARPA GALE program was collected
at LDC’s Philadelphia, PA USA facilities and at three remote
collection sites. The combined local and outsourced broadcast
collection supported GALE at a rate of approximately 300 hours per
week of programming from more than 50 broadcast sources for a
total of over 30,000 hours of collected broadcast audio over the
life of the program.<o:p></o:p></p>
<p class="MsoNormal">LDC's local broadcast collection system is
highly automated, easily extensible and robust and capable of
collecting, processing and evaluating hundreds of hours of content
from several dozen sources per day. The broadcast material is
served to the system by a set of free-to-air (FTA) satellite
receivers, commercial direct satellite systems (DSS) such as
DirecTV, direct broadcast satellite (DBS) receivers, and cable
television (CATV) feeds. The mapping between receivers and
recorders is dynamic and modular; all signal routing is performed
under computer control, using a 256x64 A/V matrix switch. Programs
are recorded in a high bandwidth A/V format and are then processed
to extract audio, to generate keyframes and compressed
audio/video, to produce time-synchronized closed captions (in the
case of North American English) and to generate automatic speech
recognition (ASR) output. <o:p></o:p></p>
<p class="MsoNormal">The broadcast conversation recordings in this
release feature interviews, call-in programs and round table
discussions focusing principally on current events from several
sources. This release contains 143 audio files presented in .wav,
16000 Hz single-channel 16-bit PCM. Each file was audited by a
native Arabic speaker following Audit Procedure Specification
Version 2.0 which is included in this release. The broadcast
auditing process served three principal goals: as a check on the
operation of LDCs broadcast collection system equipment by
identifying failed, incomplete or faulty recordings; as an
indicator of broadcast schedule changes by identifying instances
when the incorrect program was recorded; and as a guide for data
selection by retaining information about a program's genre, data
type and topic.<br>
<br>
<o:p></o:p></p>
<p class="MsoNormal" align="center">*<o:p></o:p></p>
<p class="MsoNormal"><a name="gale2"></a>(2) <a
href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2013T04">GALE
Phase
2 Arabic Broadcast Conversation Transcripts - Part 1</a> was
developed by LDC and contains transcriptions of approximately 123
hours of Arabic broadcast conversation speech collected in 2006
and 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco
during Phase 2 of the DARPA GALE (Global Autonomous Language
Exploitation) program. The source broadcast conversation
recordings feature interviews, call-in programs and round table
discussions focusing principally on current events from several
sources.<o:p></o:p></p>
<p class="MsoNormal">The transcript files are in plain-text,
tab-delimited format (TDF) with UTF-8 encoding, and the
transcribed data totals 752,747 tokens. The transcripts were
created with the LDC-developed transcription tool, <a
href="http://www.ldc.upenn.edu/tools/XTrans/downloads/">XTrans</a>,
a multi-platform, multilingual, multi-channel transcription tool
that supports manual transcription and annotation of audio
recordings. <o:p></o:p></p>
<p class="MsoNormal">The files in this corpus were transcribed by
LDC staff and/or by transcription vendors under contract to LDC.
Transcribers followed LDCs quick transcription guidelines (QTR)
and quick rich transcription specification (QRTR) both of which
are included in the documentation with this release. QTR
transcription consists of quick (near-)verbatim, time-aligned
transcripts plus speaker identification with minimal additional
mark-up. It does not include sentence unit annotation. QRTR
annotation adds structural information such as topic boundaries
and manual sentence unit annotation to the core components of a
quick transcript. Files with QTR as part of the filename were
developed using QTR transcription. Files with QRTR in the filename
indicate QRTR transcription.<o:p></o:p></p>
<o:p></o:p>
<p class="MsoNormal" align="center">*<o:p></o:p></p>
<p class="MsoNormal"><a name="mt"></a>(3) <a
href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2013T03">NIST
2012 Open Machine Translation (OpenMT) Evaluation</a> was
developed by <a href="http://nist.gov/itl/iad/mig/">NIST
Multimodal Information Group</a>. This release contains source
data, reference translations and scoring software used in the NIST
2012 OpenMT evaluation, specifically, for the Chinese-to-English
language pair track. The package was compiled and scoring software
was developed at NIST, making use of Chinese newswire and web data
and reference translations collected and developed by LDC. The
objective of the OpenMT evaluation series is to support research
in, and help advance the state of the art of, machine translation
(MT) technologies -- technologies that translate text between
human languages. Input may include all forms of text. The goal is
for the output to be an adequate and fluent translation of the
original. <o:p></o:p></p>
<p class="MsoNormal">The 2012 task was to evaluate five language
pairs: Arabic-to-English, Chinese-to-English, Dari-to-English,
Farsi-to-English and Korean-to-English. This release consists of
the material used in the Chinese-to-English language pair track.
For more general information about the NIST OpenMT evaluations,
please refer to the <a
href="http://www.nist.gov/itl/iad/mig/openmt.cfm">NIST OpenMT
website</a>.<o:p></o:p></p>
<p class="MsoNormal">This evaluation kit includes a single Perl
script (mteval-v13a.pl) that may be used to produce a translation
quality score for one (or more) MT systems. The script works by
comparing the system output translation with a set of (expert)
reference translations of the same source text. Comparison is
based on finding sequences of words in the reference translations
that match word sequences in the system output translation.<o:p></o:p></p>
<p class="MsoNormal">This release contains 222 documents with
corresponding source and reference files, the latter of which
contains four independent human reference translations of the
source data. The source data is comprised of Chinese newswire and
web data collected by LDC in 2011. A portion of the web data
concerned the topic of food and was treated as a restricted
domain. The table below displays statistics by source, genre,
documents, segments and source tokens.<o:p></o:p></p>
<table class="MsoNormalTable" style="mso-cellspacing:1.5pt;
mso-yfti-tbllook:1184" border="0" cellpadding="0">
<tbody>
<tr style="mso-yfti-irow:0;mso-yfti-firstrow:yes">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal"><b>Source</b><b><o:p></o:p></b></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal"><b>Genre</b><b><o:p></o:p></b></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal"><b>Documents</b><b><o:p></o:p></b></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal"><b>Segments</b><b><o:p></o:p></b></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal"><b>Source Tokens</b><b><o:p></o:p></b></p>
</td>
</tr>
<tr style="mso-yfti-irow:1">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Chinese General<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Newswire<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">45<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">400<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">18184<o:p></o:p></p>
</td>
</tr>
<tr style="mso-yfti-irow:2">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Chinese General<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Web Data<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">28<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">420<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">15181<o:p></o:p></p>
</td>
</tr>
<tr style="mso-yfti-irow:3;mso-yfti-lastrow:yes">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Chinese Restricted Domain<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Web Data<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">149<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">2184<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">48422<o:p></o:p></p>
</td>
</tr>
</tbody>
</table>
<p class="MsoNormal">The token counts for Chinese data are
"character" counts, which were obtained by counting tokens
matching the UNICODE-based regular expression "/w". The Python
“re” module was used to obtain those counts.<o:p></o:p></p>
<p class="MsoNormal"><br>
</p>
<hr size="2" width="100%">
<pre class="moz-signature" cols="72">--
--
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810 <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></pre>
<br>
</body>
</html>