<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

</head>

<body bgcolor="#ffffff" text="#000000">

<div class="moz-text-html" lang="x-western">

<div align="center">LDC2009T05<br>

<b>-  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T05">2008

NIST Metrics for Machine Translation (MetricsMATR08)

Development Data</a>  -</b><br>

<br>

LDC2009T06<br>

<b>-  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T06">GALE

Phase 1 Chinese Broadcast Conversation Parallel Text - Part

2</a>  -</b><br>

<br>

LDC2009T07<br>

<b>-  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T07">Unified

Linguistic Annotation Text Collection</a>  -</b><br>

<br>

<b class="moz-txt-star">-  Additional Free LDC Resources  -</b><br>

</div>

<b class="moz-txt-star"><br>

</b>

<div align="center">The Linguistic Data

Consortium (LDC) would

like to announce the availability of

three new publications and highlight free LDC resources.</div>

<br>

<hr size="2" width="100%">

<div align="center"><b>New Publications</b><br>

</div>

<b>

</b>

<p align="left">(1) <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T05">2008

NIST Metrics for Machine Translation (MetricsMATR08)

Development Data</a> contains data, reference translations, and

software

used for <a href="http://www.nist.gov/speech/tests/metricsmatr/">NIST

MetricsMATR</a>.  NIST MetricsMATR is a series of research challenge

events for machine

translation (MT) metrology, promoting the development of innovative,

even revolutionary, MT metrics that correlate highly with human

assessments of MT quality. In this program, participants submit their

metrics to the <a href="http://www.nist.gov">National Institute of

Standards and Technology (NIST)</a>. NIST runs those metrics on certain

held-back test data for which it has human assessments measuring

quality and then calculates correlations between the automatic metric

scores and the human assessments. </p>

<p>In the <a href="http://www.nist.gov/speech/tests/metricsmatr/2008/">NIST

Metrics for Machine Translation 2008 Evaluation (MetricsMATR08)</a>,

participants received as development data a subset of the materials

used in the <a href="http://nist.gov/speech/tests/mt/2006/">NIST Open

MT06 evaluation</a>, specifically, human reference translations, system

translations, and human assessments of adequacy and preference. The

source data was comprised of twenty-five Arabic language newswire

documents with a total of 249 segments. The data in each segment

consisted of four human reference translations in English and system

translations from eight different MT06 machine translation systems. In

addition to the data and reference translations, this release includes

software tools for evaluation and reporting and documentation

describing how the human assessments were obtained and how they are

represented in the data. The <a

 href="http://www.nist.gov/speech/tests/metricsmatr/2008/doc/mm08_evalplan_v1.1.pdf">evaluation

plan</a> contains further information and rules on the use of this

data. </p>

<p>The MetricsMATR program seeks to overcome several drawbacks to the

methods employed for the evaluation of MT technology. Currently,

automatic metrics have not yet proved able to predict the usefulness

and reliability of MT technologies with confidence. Nor have automatic

metrics demonstrated that they are meaningful in target languages other

than English. Human assessments, however, are expensive, slow,

subjective and difficult to standardize. These problems, and the need

to overcome them through the development of improved automatic (or even

semi-automatic) metrics, have been a constant point of discussion at

past NIST MT evaluation events. MetricsMATR aims to provide a platform

to address these shortcomings. <br>

</p>

<br>

<br>

<div align="center"><b>*</b><br>

</div>

<br>

<p>(2) <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T06">GALE

Phase 1 Chinese Broadcast Conversation Parallel Text - Part

2</a> contains transcripts and English translations of 24 hours of

Chinese

broadcast conversation programming from China Central TV (CCTV),

Phoenix TV and Voice of America (VOA). It does not contain the audio

files from which the transcripts and translations were generated. This

release, along with other corpora, was used as training data in Phase 1

(year 1) of the DARPA-funded GALE program. <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T02">GALE

Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1</a> was

released in January 2009.</p>

<p>A manual selection procedure was used to choose data appropriate for

the GALE program, namely, conversation (talk) programs focusing on

current events. Stories on topics such as sports, entertainment and

business were excluded from the data set. <br>

</p>

<p>The selected audio snippets were carefully transcribed by LDC

annotators and

professional transcription agencies following LDC's Quick Rich

Transcription

specification. Manual sentence units/segments (SU) annotation was also

performed as part of the transcription task. Three types of end of

sentence SU

were identified: statement SU, question SU, and incomplete SU.<o:p></o:p></p>

After transcription and SU annotation, files were reformatted into a

human-readable translation format and assigned to professional

translators for

careful translation. Translators followed LDC's GALE Translation

guidelines

which describe the makeup of the translation team, the source data

format, the

translation data format, best practices for translating certain

linguistic

features (such as names and speech disfluencies) and quality control

procedures

applied to completed translations.<br>

<br>

<br>

<div align="center"><b>*</b><br>

</div>

<b><br>

</b>

(3) The <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T07">Unified

Linguistic Annotation Text Collection</a> consists of two datasets: 

the Language Understanding

Annotation Corpus (LDC2009T10) and Reflex Entity Translation Training

Dev/Test (LDC2009T11).  Most recent annotation efforts for language

have focused on

small pieces of the larger problem of semantic annotation rather than

producing a single unified representation. The Unified Linguistic

Annotation (ULA) project, sponsored by the <a

 href="http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0551615">National

Science Foundation</a>, seeks to integrate into one framework different

layers of annotation (e.g., semantics, discourse, temporal, opinions)

using various existing resources, including <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T14">PropBank</a>,

<a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T23">NomBank</a>,

<a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T08">TimeBank</a>,

<a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T05">Penn

Discourse Treebank</a> and coreference and opinion annotations. The

project represents a concerted effort of researchers from several

institutions to develop a large word corpus with balanced and annotated

data. The Unified

Linguistic Annotation Text Collection is provided as a resource for

the ULA

effort. It consists of two datasets:<br>

<br>

<ul>

  <li><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T10">The

Language Understanding Annotation Corpus</a> (LDC2009T10). The Language

Understanding Annotation Corpus was developed at the <a

 href="http://web.jhu.edu/hltcoe">Johns Hopkins Center of Excellence in

Human Language Technology</a>.  It consists of over 9000 words of

English

text (6949 words) and Arabic text (2183 words) annotated for committed

belief, event and entity, coreference, dialog acts and temporal

relations. The materials were chosen from various sources to represent

"informal input," that is, text that contains colloquial forms. The

documents in the corpus include excerpts from newswire stories,

telephone conversation transcripts, emails, contracts and written

instructions.</li>

</ul>

<ul>

  <li><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T11">REFLEX

Entity Translation Training/DevTest</a> (LDC2009T11). REFLEX Entity

Translation Training/DevTest is the complete set of training data and

development test data for the <a

 href="http://www.nist.gov/speech/tests/ace/2007/">2007 REFLEX Entity

Translation evaluation</a> sponsored by the National Institute of

Standards and Technology (NIST). It contains approximately 67.5K words

of newswire and weblog text for each of English, Chinese and Arabic (or

approximately 22.5K words in each language) translated into each of the

other two languages. The data is annotated for entities and TIMEX2

extents and normalization. </li>

</ul>

<br>

Researchers may

license this data by completing the <a

 href="http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf">LDC

User Agreement for Non-members</a>.  The agreement can be faxed to +1

215 573 2175 or scanned and emailed to this address.  Please indicate

on the license whether you are requesting the entire collection

(LDC2009T07) or just one dataset (LDC2009T10 or LDC2009T11).  The

collection is

being made available at no charge.<br>

<br>

<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"

 align="center"><b>Additional

Free LDC Resources<br>

</b></p>

<p class="MsoNormal" style="margin-bottom: 12pt;">LDC is pleased to

distribute the Unified Linguistic Annotation Text Collection

(LDC2009T07) corpora at no cost to

support the work of the ULA project.<o:p></o:p> As mentioned above, to

license a copy of this data, non-members should

complete the <a

 href="http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf">LDC

User Agreement for Non-members</a> and fax to +1

215 573 2175 or scan and email to this address.  <span

 class="moz-txt-star">On the

heels of the release of </span>the ULA corpora, LDC would like to

highlight other resources which are

available at no cost.  Free grant-covered copies of the following <a

 href="http://www.talkbank.org/">Talkbank</a> databases can be licensed

from

LDC:<o:p></o:p></p>

<ul type="disc">

  <li class="MsoNormal" style=""><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003V01">LDC2003V01 

FORM2 Kinematic Gesture</a><o:p></o:p></li>

</ul>

<ul type="disc">

  <li class="MsoNormal" style=""><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003L01">LDC2003L01 

Grassfields Bantu Fieldwork: Dschang Lexicon</a><o:p></o:p></li>

</ul>

<ul type="disc">

  <li class="MsoNormal" style=""><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003S02">LDC2003S02 

Grassfields Bantu Fieldwork: Dschang Tone Paradigms</a><o:p></o:p></li>

</ul>

<ul type="disc">

  <li class="MsoNormal" style=""><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001S16">LDC2001S16 

Grassfields Bantu Fieldwork: Ngomba Tone Paradigms</a><o:p></o:p></li>

</ul>

<ul type="disc">

  <li class="MsoNormal" style=""><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004L01">LDC2004L01 

Klex: Finite-State Lexical Transducer for Korean</a><o:p></o:p></li>

</ul>

<ul type="disc">

  <li class="MsoNormal" style=""><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T03">LDC2004T03 

Morphologically Annotated Korean Text</a><o:p></o:p></li>

</ul>

<ul type="disc">

  <li class="MsoNormal" style=""><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003S06">LDC2003S06 

Santa Barbara Corpus of Spoken American English Part II</a><o:p></o:p></li>

</ul>

<ul type="disc">

  <li class="MsoNormal" style=""><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S10">LDC2004S10 

Santa Barbara Corpus of Spoken American English Part III</a><o:p></o:p></li>

</ul>

<ul type="disc">

  <li class="MsoNormal" style=""><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S25">L</a><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S25">DC2005S25 

Santa Barbara Corpus of Spoken American English Part IV</a><o:p></o:p></li>

</ul>

<ul type="disc">

  <li class="MsoNormal" style=""><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T15">LDC2003T15 

SLX Corpus of Classic Sociolinguistic Interviews</a><o:p></o:p></li>

</ul>

<ul type="disc">

  <li class="MsoNormal" style=""><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S12">LDC2004S12 

TalkBank Ethology Data: Field Recordings of Vervet Monkey Calls</a><o:p></o:p></li>

</ul>

<p class="MsoNormal"><o:p> </o:p></p>

<p class="MsoNormal" style="margin-bottom: 12pt;">A US$30 shipping and

handling fee applies for data on disc.  Further information,

including

additional free datasets such as <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T08">TimeBank

1.2</a>, and useful tools such as LDC's parallel text sentence aligner,

<a href="http://sourceforge.net/projects/champollion/">Champollion</a>,

can be

found in our <a href="http://www.ldc.upenn.edu/About/whatsnew.shtml">What's

New! What's Free! Archive</a>.<br style="">

<!--[if !supportLineBreakNewLine]--><br style="">

<!--[endif]--><o:p></o:p></p>

<span class="moz-txt-star"><br>

</span>

<hr size="2" width="100%">

<div align="center"><font face="Courier New, Courier, monospace"><small><small><big>Ilya

Ahtaridis<br>

Membership Coordinator</big><br>

<br>

</small>--------------------------------------------------------------------</small><small><br>

</small></font></div>

<div align="center">

<pre class="moz-signature" cols="72"><font

 face="Courier New, Courier, monospace">Linguistic Data Consortium                     Phone: (215) 573-1275

University of Pennsylvania                       Fax: (215) 573-2175

3600 Market St., Suite 810                         <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

 Philadelphia, PA 19104 USA                   <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>

</div>

</div>

<pre class="moz-signature" cols="72">

</pre>

</body>

</html>