<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
</head>
<body bgcolor="#ffffff" text="#000000">
<div class="moz-text-html" lang="x-western">
<div align="center">LDC2009T05<br>
<b>- <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T05">2008
NIST Metrics for Machine Translation (MetricsMATR08)
Development Data</a> -</b><br>
<br>
LDC2009T06<br>
<b>- <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T06">GALE
Phase 1 Chinese Broadcast Conversation Parallel Text - Part
2</a> -</b><br>
<br>
LDC2009T07<br>
<b>- <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T07">Unified
Linguistic Annotation Text Collection</a> -</b><br>
<br>
<b class="moz-txt-star">- Additional Free LDC Resources -</b><br>
</div>
<b class="moz-txt-star"><br>
</b>
<div align="center">The Linguistic Data
Consortium (LDC) would
like to announce the availability of
three new publications and highlight free LDC resources.</div>
<br>
<hr size="2" width="100%">
<div align="center"><b>New Publications</b><br>
</div>
<b>
</b>
<p align="left">(1) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T05">2008
NIST Metrics for Machine Translation (MetricsMATR08)
Development Data</a> contains data, reference translations, and
software
used for <a href="http://www.nist.gov/speech/tests/metricsmatr/">NIST
MetricsMATR</a>. NIST MetricsMATR is a series of research challenge
events for machine
translation (MT) metrology, promoting the development of innovative,
even revolutionary, MT metrics that correlate highly with human
assessments of MT quality. In this program, participants submit their
metrics to the <a href="http://www.nist.gov">National Institute of
Standards and Technology (NIST)</a>. NIST runs those metrics on certain
held-back test data for which it has human assessments measuring
quality and then calculates correlations between the automatic metric
scores and the human assessments. </p>
<p>In the <a href="http://www.nist.gov/speech/tests/metricsmatr/2008/">NIST
Metrics for Machine Translation 2008 Evaluation (MetricsMATR08)</a>,
participants received as development data a subset of the materials
used in the <a href="http://nist.gov/speech/tests/mt/2006/">NIST Open
MT06 evaluation</a>, specifically, human reference translations, system
translations, and human assessments of adequacy and preference. The
source data was comprised of twenty-five Arabic language newswire
documents with a total of 249 segments. The data in each segment
consisted of four human reference translations in English and system
translations from eight different MT06 machine translation systems. In
addition to the data and reference translations, this release includes
software tools for evaluation and reporting and documentation
describing how the human assessments were obtained and how they are
represented in the data. The <a
href="http://www.nist.gov/speech/tests/metricsmatr/2008/doc/mm08_evalplan_v1.1.pdf">evaluation
plan</a> contains further information and rules on the use of this
data. </p>
<p>The MetricsMATR program seeks to overcome several drawbacks to the
methods employed for the evaluation of MT technology. Currently,
automatic metrics have not yet proved able to predict the usefulness
and reliability of MT technologies with confidence. Nor have automatic
metrics demonstrated that they are meaningful in target languages other
than English. Human assessments, however, are expensive, slow,
subjective and difficult to standardize. These problems, and the need
to overcome them through the development of improved automatic (or even
semi-automatic) metrics, have been a constant point of discussion at
past NIST MT evaluation events. MetricsMATR aims to provide a platform
to address these shortcomings. <br>
</p>
<br>
<br>
<div align="center"><b>*</b><br>
</div>
<br>
<p>(2) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T06">GALE
Phase 1 Chinese Broadcast Conversation Parallel Text - Part
2</a> contains transcripts and English translations of 24 hours of
Chinese
broadcast conversation programming from China Central TV (CCTV),
Phoenix TV and Voice of America (VOA). It does not contain the audio
files from which the transcripts and translations were generated. This
release, along with other corpora, was used as training data in Phase 1
(year 1) of the DARPA-funded GALE program. <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T02">GALE
Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1</a> was
released in January 2009.</p>
<p>A manual selection procedure was used to choose data appropriate for
the GALE program, namely, conversation (talk) programs focusing on
current events. Stories on topics such as sports, entertainment and
business were excluded from the data set. <br>
</p>
<p>The selected audio snippets were carefully transcribed by LDC
annotators and
professional transcription agencies following LDC's Quick Rich
Transcription
specification. Manual sentence units/segments (SU) annotation was also
performed as part of the transcription task. Three types of end of
sentence SU
were identified: statement SU, question SU, and incomplete SU.<o:p></o:p></p>
After transcription and SU annotation, files were reformatted into a
human-readable translation format and assigned to professional
translators for
careful translation. Translators followed LDC's GALE Translation
guidelines
which describe the makeup of the translation team, the source data
format, the
translation data format, best practices for translating certain
linguistic
features (such as names and speech disfluencies) and quality control
procedures
applied to completed translations.<br>
<br>
<br>
<div align="center"><b>*</b><br>
</div>
<b><br>
</b>
(3) The <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T07">Unified
Linguistic Annotation Text Collection</a> consists of two datasets:
the Language Understanding
Annotation Corpus (LDC2009T10) and Reflex Entity Translation Training
Dev/Test (LDC2009T11). Most recent annotation efforts for language
have focused on
small pieces of the larger problem of semantic annotation rather than
producing a single unified representation. The Unified Linguistic
Annotation (ULA) project, sponsored by the <a
href="http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0551615">National
Science Foundation</a>, seeks to integrate into one framework different
layers of annotation (e.g., semantics, discourse, temporal, opinions)
using various existing resources, including <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T14">PropBank</a>,
<a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T23">NomBank</a>,
<a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T08">TimeBank</a>,
<a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T05">Penn
Discourse Treebank</a> and coreference and opinion annotations. The
project represents a concerted effort of researchers from several
institutions to develop a large word corpus with balanced and annotated
data. The Unified
Linguistic Annotation Text Collection is provided as a resource for
the ULA
effort. It consists of two datasets:<br>
<br>
<ul>
<li><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T10">The
Language Understanding Annotation Corpus</a> (LDC2009T10). The Language
Understanding Annotation Corpus was developed at the <a
href="http://web.jhu.edu/hltcoe">Johns Hopkins Center of Excellence in
Human Language Technology</a>. It consists of over 9000 words of
English
text (6949 words) and Arabic text (2183 words) annotated for committed
belief, event and entity, coreference, dialog acts and temporal
relations. The materials were chosen from various sources to represent
"informal input," that is, text that contains colloquial forms. The
documents in the corpus include excerpts from newswire stories,
telephone conversation transcripts, emails, contracts and written
instructions.</li>
</ul>
<ul>
<li><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T11">REFLEX
Entity Translation Training/DevTest</a> (LDC2009T11). REFLEX Entity
Translation Training/DevTest is the complete set of training data and
development test data for the <a
href="http://www.nist.gov/speech/tests/ace/2007/">2007 REFLEX Entity
Translation evaluation</a> sponsored by the National Institute of
Standards and Technology (NIST). It contains approximately 67.5K words
of newswire and weblog text for each of English, Chinese and Arabic (or
approximately 22.5K words in each language) translated into each of the
other two languages. The data is annotated for entities and TIMEX2
extents and normalization. </li>
</ul>
<br>
Researchers may
license this data by completing the <a
href="http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf">LDC
User Agreement for Non-members</a>. The agreement can be faxed to +1
215 573 2175 or scanned and emailed to this address. Please indicate
on the license whether you are requesting the entire collection
(LDC2009T07) or just one dataset (LDC2009T10 or LDC2009T11). The
collection is
being made available at no charge.<br>
<br>
<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"
align="center"><b>Additional
Free LDC Resources<br>
</b></p>
<p class="MsoNormal" style="margin-bottom: 12pt;">LDC is pleased to
distribute the Unified Linguistic Annotation Text Collection
(LDC2009T07) corpora at no cost to
support the work of the ULA project.<o:p></o:p> As mentioned above, to
license a copy of this data, non-members should
complete the <a
href="http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf">LDC
User Agreement for Non-members</a> and fax to +1
215 573 2175 or scan and email to this address. <span
class="moz-txt-star">On the
heels of the release of </span>the ULA corpora, LDC would like to
highlight other resources which are
available at no cost. Free grant-covered copies of the following <a
href="http://www.talkbank.org/">Talkbank</a> databases can be licensed
from
LDC:<o:p></o:p></p>
<ul type="disc">
<li class="MsoNormal" style=""><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003V01">LDC2003V01
FORM2 Kinematic Gesture</a><o:p></o:p></li>
</ul>
<ul type="disc">
<li class="MsoNormal" style=""><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003L01">LDC2003L01
Grassfields Bantu Fieldwork: Dschang Lexicon</a><o:p></o:p></li>
</ul>
<ul type="disc">
<li class="MsoNormal" style=""><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003S02">LDC2003S02
Grassfields Bantu Fieldwork: Dschang Tone Paradigms</a><o:p></o:p></li>
</ul>
<ul type="disc">
<li class="MsoNormal" style=""><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001S16">LDC2001S16
Grassfields Bantu Fieldwork: Ngomba Tone Paradigms</a><o:p></o:p></li>
</ul>
<ul type="disc">
<li class="MsoNormal" style=""><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004L01">LDC2004L01
Klex: Finite-State Lexical Transducer for Korean</a><o:p></o:p></li>
</ul>
<ul type="disc">
<li class="MsoNormal" style=""><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T03">LDC2004T03
Morphologically Annotated Korean Text</a><o:p></o:p></li>
</ul>
<ul type="disc">
<li class="MsoNormal" style=""><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003S06">LDC2003S06
Santa Barbara Corpus of Spoken American English Part II</a><o:p></o:p></li>
</ul>
<ul type="disc">
<li class="MsoNormal" style=""><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S10">LDC2004S10
Santa Barbara Corpus of Spoken American English Part III</a><o:p></o:p></li>
</ul>
<ul type="disc">
<li class="MsoNormal" style=""><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S25">L</a><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S25">DC2005S25
Santa Barbara Corpus of Spoken American English Part IV</a><o:p></o:p></li>
</ul>
<ul type="disc">
<li class="MsoNormal" style=""><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T15">LDC2003T15
SLX Corpus of Classic Sociolinguistic Interviews</a><o:p></o:p></li>
</ul>
<ul type="disc">
<li class="MsoNormal" style=""><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S12">LDC2004S12
TalkBank Ethology Data: Field Recordings of Vervet Monkey Calls</a><o:p></o:p></li>
</ul>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-bottom: 12pt;">A US$30 shipping and
handling fee applies for data on disc. Further information,
including
additional free datasets such as <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T08">TimeBank
1.2</a>, and useful tools such as LDC's parallel text sentence aligner,
<a href="http://sourceforge.net/projects/champollion/">Champollion</a>,
can be
found in our <a href="http://www.ldc.upenn.edu/About/whatsnew.shtml">What's
New! What's Free! Archive</a>.<br style="">
<!--[if !supportLineBreakNewLine]--><br style="">
<!--[endif]--><o:p></o:p></p>
<span class="moz-txt-star"><br>
</span>
<hr size="2" width="100%">
<div align="center"><font face="Courier New, Courier, monospace"><small><small><big>Ilya
Ahtaridis<br>
Membership Coordinator</big><br>
<br>
</small>--------------------------------------------------------------------</small><small><br>
</small></font></div>
<div align="center">
<pre class="moz-signature" cols="72"><font
face="Courier New, Courier, monospace">Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 <a
class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a
class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>
</div>
</div>
<pre class="moz-signature" cols="72">
</pre>
</body>
</html>