<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p class="MsoNormal" align="center"><b><br>
</b><b> </b><i>New publications:</i><br>
<br>
<b>- </b><b> <a href="#speech">GALE Phase 2 Chinese Broadcast
Conversation Speech</a></b><b> -<br>
</b><b> </b><b><br>
</b><b> - </b> <b><a href="#transcripts">GALE Phase 2 Chinese
Broadcast Conversation Transcripts</a></b><b> -<br>
</b><b> </b><b><br>
</b><b> - </b> <b><a href="#openmt">NIST 2008-2012 Open Machine
Translation (OpenMT) Progress Test Sets</a></b>
-<br>
<br>
<b></b><o:p></o:p></p>
<div class="MsoNormal" style="text-align:center" align="center">
<hr size="2" width="100%" align="center"> </div>
<br>
<p class="MsoNormal" align="center"><b>New publications</b><br>
</p>
<p class="MsoNormal"><br>
<o:p></o:p></p>
<p class="MsoNormal"><a name="speech"></a>(1) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2013S04">GALE
Phase
2 Chinese Broadcast Conversation Speech</a> (LDC2013S04) was
developed by LDC and is comprised of approximately 120 hours of
Chinese broadcast conversation speech collected in 2006 and 2007
by LDC and Hong University of Science and Technology (HKUST), Hong
Kong, during Phase 2 of the DARPA GALE (Global Autonomous Language
Exploitation) Program. <o:p></o:p></p>
<p class="MsoNormal">Corresponding transcripts are released as GALE
Phase 2 Chinese Broadcast Conversation Transcripts (<a
href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2013T08">LDC2013T08</a>).<o:p></o:p></p>
<p class="MsoNormal">Broadcast audio for the GALE program was
collected at the Philadelphia, PA USA facilities of LDC and at
three remote collection sites: HKUST (Chinese) Medianet, Tunis,
Tunisia (Arabic) and MTC, Rabat, Morocco (Arabic). The combined
local and outsourced broadcast collection supported GALE at a rate
of approximately 300 hours per week of programming from more than
50 broadcast sources for a total of over 30,000 hours of collected
broadcast audio over the life of the program.<o:p></o:p></p>
<p class="MsoNormal">The broadcast conversation recordings in this
release feature interviews, call-in programs and roundtable
discussions focusing principally on current events from the
following sources: Anhui TV, a regional television station in
Mainland China, Anhui Province; China Central TV (CCTV), a
national and international broadcaster in Mainland China; Hubei
TV, a regional broadcaster in Mainland China, Hubei Province; and
Phoenix TV, a Hong Kong-based satellite television station. A
table showing the number of programs and hours recorded from each
source is contained in the readme file. <o:p></o:p></p>
<p class="MsoNormal">This release contains 202 audio files presented
in Waveform Audio File format (.wav), 16000 Hz single-channel
16-bit PCM. Each file was audited by a native Chinese speaker
following Audit Procedure Specification Version 2.0 which is
included in this release. The broadcast auditing process served
three principal goals: as a check on the operation of the
broadcast collection system equipment by identifying failed,
incomplete or faulty recordings; as an indicator of broadcast
schedule changes by identifying instances when the incorrect
program was recorded; and as a guide for data selection by
retaining information about the genre, data type and topic of a
program. <o:p></o:p></p>
<p class="MsoNormal" align="center">*<o:p></o:p></p>
<p class="MsoNormal"><br>
<a name="transcripts"></a>(2) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2013T08">GALE
Phase
2 Chinese Broadcast Conversation Transcripts</a> (LDC2013T08)
was developed by LDC and contains transcriptions of approximately
120 hours of Chinese broadcast conversation speech collected in
2006 and 2007 by LDC and Hong University of Science and Technology
(HKUST), Hong Kong, during Phase 2 of the DARPA GALE (Global
Autonomous Language Exploitation) Program. <o:p></o:p></p>
<p class="MsoNormal">Corresponding audio data is released as GALE
Phase 2 Chinese Broadcast Conversation Speech (<a
href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2013S04">LDC2013S04</a>).<o:p></o:p></p>
<p class="MsoNormal">The source broadcast conversation recordings
feature interviews, call-in programs and round table discussions
focusing principally on current events from the following sources:
Anhui TV, a regional television station in Mainland China, Anhui
Province; China Central TV (CCTV), a national and international
broadcaster in Mainland China; Hubei TV, a regional broadcaster in
Mainland China, Hubei Province; and Phoenix TV, a Hong Kong-based
satellite television station.<o:p></o:p></p>
<p class="MsoNormal">The transcript files are in plain-text,
tab-delimited format (TDF) with UTF-8 encoding, and the
transcribed data totals 1,523,373 tokens. The transcripts were
created with the LDC-developed transcription tool, <a
href="http://www.ldc.upenn.edu/tools/XTrans/downloads/">XTrans</a>,
a multi-platform, multilingual, multi-channel transcription tool
that supports manual transcription and annotation of audio
recordings. <o:p></o:p></p>
<p class="MsoNormal">The files in this corpus were transcribed by
LDC staff and/or by transcription vendors under contract to LDC.
Transcribers followed LDC’s quick transcription guidelines (QTR)
and quick rich transcription specification (QRTR) both of which
are included in the documentation with this release. QTR
transcription consists of quick (near-)verbatim, time-aligned
transcripts plus speaker identification with minimal additional
mark-up. It does not include sentence unit annotation. QRTR
annotation adds structural information such as topic boundaries
and manual sentence unit annotation to the core components of a
quick transcript. Files with QTR as part of the filename were
developed using QTR transcription. Files with QRTR in the filename
indicate QRTR transcription.<o:p></o:p></p>
<p class="MsoNormal"><br>
<o:p></o:p></p>
<p class="MsoNormal" align="center">*<o:p></o:p></p>
<p class="MsoNormal"><a name="openmt"></a>(3) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2013T07">NIST
2008-2012
Open Machine Translation (OpenMT) Progress Test Sets</a>
(LDC2013T07) was developed by <a
href="http://nist.gov/itl/iad/mig/">NIST Multimodal Information
Group</a>. This release contains the evaluation sets (source
data and human reference translations), DTD, scoring software, and
evaluation plans for the Arabic-to-English and Chinese-to-English
progress test sets for the NIST OpenMT 2008, 2009, and 2012
evaluations. The test data remained unseen between evaluations and
was reused unchanged each time. The package was compiled, and
scoring software was developed, at NIST, making use of Chinese and
Arabic newswire and web data and reference translations collected
and developed by LDC. <o:p></o:p></p>
<p class="MsoNormal">The objective of the OpenMT evaluation series
is to support research in, and help advance the state of the art
of, machine translation (MT) technologies -- technologies that
translate text between human languages. Input may include all
forms of text. The goal is for the output to be an adequate and
fluent translation of the original. <o:p></o:p></p>
<p class="MsoNormal">The MT evaluation series started in 2001 as
part of the DARPA TIDES (Translingual Information Detection,
Extraction) program. Beginning with the 2006 evaluation, the
evaluations have been driven and coordinated by NIST as NIST
OpenMT. These evaluations provide an important contribution to the
direction of research efforts and the calibration of technical
capabilities in MT. The OpenMT evaluations are intended to be of
interest to all researchers working on the general problem of
automatic translation between human languages. To this end, they
are designed to be simple, to focus on core technology issues and
to be fully supported. For more general information about the NIST
OpenMT evaluations, please refer to the <a
href="http://www.nist.gov/itl/iad/mig/openmt.cfm">NIST OpenMT
website</a>.<o:p></o:p></p>
<p class="MsoNormal">This evaluation kit includes a single Perl
script (mteval-v13a.pl) that may be used to produce a translation
quality score for one (or more) MT systems. The script works by
comparing the system output translation with a set of (expert)
reference translations of the same source text. Comparison is
based on finding sequences of words in the reference translations
that match word sequences in the system output translation.<o:p></o:p></p>
<p class="MsoNormal">This release contains 2,748 documents with
corresponding source and reference files, the latter of which
contains four independent human reference translations of the
source data. The source data is comprised of Arabic and Chinese
newswire and web data collected by LDC in 2007. The table below
displays statistics by source, genre, documents, segments and
source tokens.<o:p></o:p></p>
<table class="MsoNormalTable" style="mso-cellspacing:1.5pt;
mso-yfti-tbllook:1184" border="0" cellpadding="0">
<tbody>
<tr style="mso-yfti-irow:0;mso-yfti-firstrow:yes">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Source<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Genre<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Documents<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Segments<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Source Tokens<o:p></o:p></p>
</td>
</tr>
<tr style="mso-yfti-irow:1">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Arabic<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Newswire<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">84<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">784<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">20039<o:p></o:p></p>
</td>
</tr>
<tr style="mso-yfti-irow:2">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Arabic<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Web Data<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">51<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">594<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">14793<o:p></o:p></p>
</td>
</tr>
<tr style="mso-yfti-irow:3">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Chinese<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Newswire<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">82<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">688<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">26923<o:p></o:p></p>
</td>
</tr>
<tr style="mso-yfti-irow:4;mso-yfti-lastrow:yes">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Chinese<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Web Data<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">40<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">682<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">19112<o:p></o:p></p>
</td>
</tr>
</tbody>
</table>
<p class="MsoNormal"><o:p> </o:p><br>
</p>
<br>
<div style="mso-element:comment-list"><br>
<hr size="2" width="100%"></div>
<pre class="moz-signature" cols="72">
</pre>
<pre class="moz-signature" cols="72">--
--
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810 <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a>
</pre>
</body>
</html>