<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
</head>
<body bgcolor="#ffffff" text="#000000">
<div align="center"><b><br>
</b><i>New Publications:</i><b><br>
</b><br>
LDC2010T02<br>
- <b><a href="#czech">Czech
Broadcast News MDE Transcripts -</a></b><br>
<br>
LDC2010T03<br>
- <b><a href="#gale">GALE
Phase 1 Chinese Newsgroup Parallel Text -
Part 2</a></b> -<br>
<br>
LDC2010T01<br>
- <b><a href="#nist">NIST
Open Machine Translation 2008 Evaluation
(MT08) Selected
Reference and System Translations</a></b> -<br>
<b><br>
</b></div>
<hr size="2" width="100%"><b><br>
</b><br>
<p class="MsoNormal" style="margin-bottom: 12pt;"><o:p></o:p></p>
<p style="text-align: center;" align="center"><b>New Publications</b><o:p></o:p></p>
<p><a name="czech">(1)</a><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T02"><span>Czech
Broadcast News MDE Transcripts</span></a>
<a name="czech"></a>was
prepared by researchers at the <st1:place><st1:placetype>University</st1:placetype>
of <st1:placename>West Bohemia</st1:placename></st1:place>, <st1:place><st1:city>Pilsen</st1:city>,
<st1:country-region>Czech Republic</st1:country-region></st1:place>. It
consists of metadata extraction (MDE) annotations for the approximately
26
hours of transcribed broadcast news speech in <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T01">Czech
Broadcast News Transcripts (LDC2004T01)</a>. The audio files
corresponding to
the transcripts in this corpus are contained in <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S01">Czech
Broadcast News Speech (LDC2004S01)</a>. Czech Broadcast News MDE
Transcripts
joins LDC's other holdings of Czech broadcast data: <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S02">Czech
Broadcast Conversation Speech (LDC2009S02)</a>, <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T20">Czech
Broadcast Conversation MDE Transcripts (LDC2009T20)</a>, <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2000S89">Voice
of America (VOA) Czech Broadcast News Audio (LDC2000S89)</a> and <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2000T53">Voice
of America (VOA) Czech Broadcast News Transcripts (LDC2000T53)</a>. <o:p></o:p></p>
<p>The <span style="color: black;">audio recordings were collected
from </span><st1:date year="2000" day="1" month="2"><span
style="color: black;">February 1, 2000</span></st1:date><span
style="color: black;"> through </span><st1:date year="2000" day="22"
month="4"><span style="color: black;">April 22, 2000</span></st1:date><span
style="color: black;">
from three Czech radio stations and two television stations</span>. The
broadcasts included both public and commercial subjects and were
presented in
various styles, ranging from a formal style to a colloquial style more
typical
for commercial broadcast companies that do not primarily focus on news.
<o:p></o:p></p>
<p>The goal of MDE research is to take raw speech recognition output
and refine
it into forms that are of more use to humans and to downstream
automatic
processes. In simple terms, this means the creation of automatic
transcripts
that are maximally readable. This readability might be achieved in a
number of
ways: removing non-content words like filled pauses and discourse
markers from
the text; removing sections of disfluent speech; and creating
boundaries
between natural breakpoints in the flow of speech so that each sentence
or
other meaningful unit of speech might be presented on a separate line
within
the resulting transcript. Natural capitalization, punctuation,
standardized
spelling and sensible conventions for representing speaker turns and
identity
are further elements in the readable transcript. <o:p></o:p></p>
<p>The transcripts and annotations in this corpus are stored in two
formats: <a href="http://www.mde.zcu.cz/qan.html">QAn (Quick Annotator)</a>,
and RTTM.
Character encoding in all files is ISO-8859-2.<o:p></o:p></p>
<br>
<p>[<a href="#top">
top </a>]
</p>
<p><br>
<o:p></o:p></p>
<p style="text-align: center;" align="center">*<o:p></o:p></p>
<p><a name="gale">(2)</a> <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T03">GALE
Phase 1 Chinese Newsgroup Parallel Text - Part 2</a> was prepared by
LDC and
contains 223,000 characters (98 files) of Chinese newsgroup text and
its
translation selected from twenty-one sources. Newsgroups consist of
posts to
electronic bulletin boards, Usenet newsgroups, discussion groups and
similar
forums. This release was used as training data in Phase 1 (year 1) of
the
DARPA-funded GALE program. <o:p></o:p></p>
<p>Preparing the source data involved four stages of work: data
scouting, data
harvesting, formating and data selection.<o:p></o:p></p>
<p class="MsoNormal" style="">Data
scouting involved manually searching the web for suitable newsgroup
text. Data scouts
were assigned particular topics and genres along with a production
target in
order to focus their web search. Formal annotation guidelines and a
customized
annotation toolkit helped data scouts to manage the search process and
to track
progress. <o:p></o:p></p>
<p>Data scouts logged their decisions about potential text of interest
to a
database. A nightly process queried the annotation database and
harvested all
designated URLs. Whenever possible, the entire site was downloaded, not
just
the individual thread or post located by the data scout. Once the text
was
downloaded, its format was standardized so that the data could be more
easily
integrated into downstream annotation processes. Typically, a new
script was
required for each new domain name that was identified. After scripts
were run,
an optional manual process corrected any remaining formatting problems.<br>
<br>
The selected documents were then reviewed for content-suitability using
a
semi-automatic process. A statistical approach was used to rank a
document's
relevance to a set of already-selected documents labeled as "good."
An annotator then reviewed the list of relevance-ranked documents and
selected
those which were suitable for a particular annotation task or for
annotation in
general. These newly-judged documents in turn provided additional input
for the
generation of new ranked lists. <o:p></o:p></p>
<p class="MsoNormal" style="">Manual
sentence units/segments (SU) annotation was also performed as part of
the
transcription task. Three types of end of sentence SU were identified:
statement SU, question SU, and incomplete SU. After transcription and
SU
annotation, files were reformatted into a human-readable translation
format and
assigned to professional translators for careful translation.
Translators
followed LDC's GALE Translation guidelines which describe the makeup of
the
translation team, the source data format, the translation data format,
best
practices for translating certain linguistic features and quality
control
procedures applied to completed translations. <o:p></o:p></p>
<p class="MsoNormal" style=""><br>
</p>
<p class="MsoNormal" style="">[<a href="#top">
top </a>]<br>
<o:p></o:p></p>
<p class="MsoNormal" style="text-align: center;" align="center">*<o:p></o:p></p>
<p><a name="nist">(3)</a> <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T01">NIST
Open Machine Translation 2008 Evaluation (MT08) Selected Reference and
System
Translations</a>. <a href="http://www.itl.nist.gov/iad/mig/tests/mt/">NIST
Open MT</a> is an evaluation series to support research in, and help
advance
the state of the art of, technologies that translate text between human
language</p>
<p>s. Participants submit machine translation output of source
language
data to NIST (National Institute of Standards and Technology); the
output is
then evaluated with automatic and manual measures of quality against
high
quality human translations of the same source data. This program
supports the
growing interest in system combination approaches that generate
improved
translations from output of several different machine translation (MT)
systems.
MT system combination approaches require data sets composed of
high-quality
human reference translations and a variety of machine translations of
the same
text. The NIST Open Machine Translation 2008 Evaluation (MT08) Selected
Reference and System Translations set addresses this need. <o:p></o:p></p>
<p>The data in this release consists of the human reference
translations and
corresponding machine translations for the <a
href="http://www.itl.nist.gov/iad/mig/tests/mt/2008/">NIST Open MT08</a>
test
sets, which consist of newswire and web data in the four MT08 language
pairs: Arabic-to-English, Chinese-to-English, English-to-Chinese
(newswire only) and Urdu-to-English. Two documents per language pair
and genre
were removed at random from the test sets for release. For the machine
translations, only output from one submission per training condition
(Constrained and Unconstrained training, where available) per
participant is included.
See section 2 of the MT08 Evaluation Plan for a description of the
training
conditions. The resulting data set has the following characteristics: <o:p></o:p></p>
<ul type="disc">
<li class="MsoNormal" style="">Arabic-to-English: 120 documents with
1312 segments, output from 17 machine translation systems.<o:p></o:p></li>
<li class="MsoNormal" style="">Chinese-to-English: 105 documents with
1312 segments, output from 23 machine translation systems.<o:p></o:p></li>
<li class="MsoNormal" style="">English-to-Chinese: 127 documents with
1830 segments, output from 11 machine translation systems.<o:p></o:p></li>
<li class="MsoNormal" style="">Urdu-to-English: 128 documents with
1794 segments, output from 12 machine translation systems.<o:p></o:p></li>
</ul>
<p>The data is organized and annotated in such a way that subsets for
each
language pair and/or data genre and/or training condition can be
extracted and
used separately, depending on the user's needs.<o:p></o:p></p>
<br>
<p class="MsoNormal" style="">[<a href="#top">
top </a>]</p>
<hr size="2" width="100%">
<p class="MsoNormal" style=""><o:p></o:p></p>
<div align="center"><font face="Courier New, Courier, monospace"><small><small><big>Ilya
Ahtaridis</big></small></small></font><br>
<font face="Courier New, Courier, monospace"><small><small><big>Membership
Coordinator</big></small></small></font><br>
<br>
<font face="Courier New, Courier, monospace"><small>--------------------------------------------------------------------</small></font><br>
</div>
<div align="center">
<pre class="moz-signature" cols="72"><font
face="Courier New, Courier, monospace">Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 <a
class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a
class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>
</div>
<pre class="moz-signature" cols="72">
</pre>
</body>
</html>