<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

</head>

<body bgcolor="#ffffff" text="#000000">

<div align="center"><b><br>

</b><i>New Publications:</i><b><br>

</b><br>

LDC2010T02<br>

- <b><a href="#czech">Czech

Broadcast News MDE Transcripts -</a></b><br>

<br>

LDC2010T03<br>

- <b><a href="#gale">GALE

Phase 1 Chinese Newsgroup Parallel Text -

Part 2</a></b> -<br>

<br>

LDC2010T01<br>

- <b><a href="#nist">NIST

Open Machine Translation 2008 Evaluation

(MT08) Selected

Reference and System Translations</a></b> -<br>

<b><br>

</b></div>

<hr size="2" width="100%"><b><br>

</b><br>

<p class="MsoNormal" style="margin-bottom: 12pt;"><o:p></o:p></p>

<p style="text-align: center;" align="center"><b>New Publications</b><o:p></o:p></p>

<p><a name="czech">(1)</a><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T02"><span>Czech

Broadcast News MDE Transcripts</span></a>

<a name="czech"></a>was

prepared by researchers at the <st1:place><st1:placetype>University</st1:placetype>

of <st1:placename>West Bohemia</st1:placename></st1:place>, <st1:place><st1:city>Pilsen</st1:city>,

<st1:country-region>Czech Republic</st1:country-region></st1:place>. It

consists of metadata extraction (MDE) annotations for the approximately

26

hours of transcribed broadcast news speech in <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T01">Czech

Broadcast News Transcripts (LDC2004T01)</a>. The audio files

corresponding to

the transcripts in this corpus are contained in <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S01">Czech

Broadcast News Speech (LDC2004S01)</a>. Czech Broadcast News MDE

Transcripts

joins LDC's other holdings of Czech broadcast data: <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S02">Czech

Broadcast Conversation Speech (LDC2009S02)</a>, <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T20">Czech

Broadcast Conversation MDE Transcripts (LDC2009T20)</a>, <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2000S89">Voice

of America (VOA) Czech Broadcast News Audio (LDC2000S89)</a> and <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2000T53">Voice

of America (VOA) Czech Broadcast News Transcripts (LDC2000T53)</a>. <o:p></o:p></p>

<p>The <span style="color: black;">audio recordings were collected

from </span><st1:date year="2000" day="1" month="2"><span

 style="color: black;">February 1, 2000</span></st1:date><span

 style="color: black;"> through </span><st1:date year="2000" day="22"

 month="4"><span style="color: black;">April 22, 2000</span></st1:date><span

 style="color: black;">

from three Czech radio stations and two television stations</span>. The

broadcasts included both public and commercial subjects and were

presented in

various styles, ranging from a formal style to a colloquial style more

typical

for commercial broadcast companies that do not primarily focus on news.

<o:p></o:p></p>

<p>The goal of MDE research is to take raw speech recognition output

and refine

it into forms that are of more use to humans and to downstream

automatic

processes. In simple terms, this means the creation of automatic

transcripts

that are maximally readable. This readability might be achieved in a

number of

ways: removing non-content words like filled pauses and discourse

markers from

the text; removing sections of disfluent speech; and creating

boundaries

between natural breakpoints in the flow of speech so that each sentence

or

other meaningful unit of speech might be presented on a separate line

within

the resulting transcript. Natural capitalization, punctuation,

standardized

spelling and sensible conventions for representing speaker turns and

identity

are further elements in the readable transcript. <o:p></o:p></p>

<p>The transcripts and annotations in this corpus are stored in two

formats: <a href="http://www.mde.zcu.cz/qan.html">QAn (Quick Annotator)</a>,

and RTTM.

Character encoding in all files is ISO-8859-2.<o:p></o:p></p>

<br>

<p>[<a href="#top">

top </a>]

</p>

<p><br>

<o:p></o:p></p>

<p style="text-align: center;" align="center">*<o:p></o:p></p>

<p><a name="gale">(2)</a> <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T03">GALE

Phase 1 Chinese Newsgroup Parallel Text - Part 2</a> was prepared by

LDC and

contains 223,000 characters (98 files) of Chinese newsgroup text and

its

translation selected from twenty-one sources. Newsgroups consist of

posts to

electronic bulletin boards, Usenet newsgroups, discussion groups and

similar

forums. This release was used as training data in Phase 1 (year 1) of

the

DARPA-funded GALE program. <o:p></o:p></p>

<p>Preparing the source data involved four stages of work: data

scouting, data

harvesting, formating and data selection.<o:p></o:p></p>

<p class="MsoNormal" style="">Data

scouting involved manually searching the web for suitable newsgroup

text. Data scouts

were assigned particular topics and genres along with a production

target in

order to focus their web search. Formal annotation guidelines and a

customized

annotation toolkit helped data scouts to manage the search process and

to track

progress. <o:p></o:p></p>

<p>Data scouts logged their decisions about potential text of interest

to a

database. A nightly process queried the annotation database and

harvested all

designated URLs. Whenever possible, the entire site was downloaded, not

just

the individual thread or post located by the data scout. Once the text

was

downloaded, its format was standardized so that the data could be more

easily

integrated into downstream annotation processes. Typically, a new

script was

required for each new domain name that was identified. After scripts

were run,

an optional manual process corrected any remaining formatting problems.<br>

<br>

The selected documents were then reviewed for content-suitability using

a

semi-automatic process. A statistical approach was used to rank a

document's

relevance to a set of already-selected documents labeled as "good."

An annotator then reviewed the list of relevance-ranked documents and

selected

those which were suitable for a particular annotation task or for

annotation in

general. These newly-judged documents in turn provided additional input

for the

generation of new ranked lists. <o:p></o:p></p>

<p class="MsoNormal" style="">Manual

sentence units/segments (SU) annotation was also performed as part of

the

transcription task. Three types of end of sentence SU were identified:

statement SU, question SU, and incomplete SU. After transcription and

SU

annotation, files were reformatted into a human-readable translation

format and

assigned to professional translators for careful translation.

Translators

followed LDC's GALE Translation guidelines which describe the makeup of

the

translation team, the source data format, the translation data format,

best

practices for translating certain linguistic features and quality

control

procedures applied to completed translations. <o:p></o:p></p>

<p class="MsoNormal" style=""><br>

</p>

<p class="MsoNormal" style="">[<a href="#top">

top </a>]<br>

<o:p></o:p></p>

<p class="MsoNormal" style="text-align: center;" align="center">*<o:p></o:p></p>

<p><a name="nist">(3)</a> <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T01">NIST

Open Machine Translation 2008 Evaluation (MT08) Selected Reference and

System

Translations</a>. <a href="http://www.itl.nist.gov/iad/mig/tests/mt/">NIST

Open MT</a> is an evaluation series to support research in, and help

advance

the state of the art of, technologies that translate text between human

language</p>

<p>s. Participants submit machine translation output of source

language

data to NIST (National Institute of Standards and Technology); the

output is

then evaluated with automatic and manual measures of quality against

high

quality human translations of the same source data. This program

supports the

growing interest in system combination approaches that generate

improved

translations from output of several different machine translation (MT)

systems.

MT system combination approaches require data sets composed of

high-quality

human reference translations and a variety of machine translations of

the same

text. The NIST Open Machine Translation 2008 Evaluation (MT08) Selected

Reference and System Translations set addresses this need. <o:p></o:p></p>

<p>The data in this release consists of the human reference

translations and

corresponding machine translations for the <a

 href="http://www.itl.nist.gov/iad/mig/tests/mt/2008/">NIST Open MT08</a>

test

sets, which consist of newswire and web data in the four MT08 language

pairs:  Arabic-to-English, Chinese-to-English, English-to-Chinese

(newswire only) and Urdu-to-English. Two documents per language pair

and genre

were removed at random from the test sets for release. For the machine

translations, only output from one submission per training condition

(Constrained and Unconstrained training, where available) per

participant is included.

See section 2 of the MT08 Evaluation Plan for a description of the

training

conditions. The resulting data set has the following characteristics: <o:p></o:p></p>

<ul type="disc">

  <li class="MsoNormal" style="">Arabic-to-English: 120 documents with

1312 segments, output from 17 machine translation systems.<o:p></o:p></li>

  <li class="MsoNormal" style="">Chinese-to-English: 105 documents with

1312 segments, output from 23 machine translation systems.<o:p></o:p></li>

  <li class="MsoNormal" style="">English-to-Chinese: 127 documents with

1830 segments, output from 11 machine translation systems.<o:p></o:p></li>

  <li class="MsoNormal" style="">Urdu-to-English: 128 documents with

1794 segments, output from 12 machine translation systems.<o:p></o:p></li>

</ul>

<p>The data is organized and annotated in such a way that subsets for

each

language pair and/or data genre and/or training condition can be

extracted and

used separately, depending on the user's needs.<o:p></o:p></p>

<br>

<p class="MsoNormal" style="">[<a href="#top">

top </a>]</p>

<hr size="2" width="100%">

<p class="MsoNormal" style=""><o:p></o:p></p>

<div align="center"><font face="Courier New, Courier, monospace"><small><small><big>Ilya

Ahtaridis</big></small></small></font><br>

<font face="Courier New, Courier, monospace"><small><small><big>Membership

Coordinator</big></small></small></font><br>

<br>

<font face="Courier New, Courier, monospace"><small>--------------------------------------------------------------------</small></font><br>

</div>

<div align="center">

<pre class="moz-signature" cols="72"><font

 face="Courier New, Courier, monospace">Linguistic Data Consortium                     Phone: (215) 573-1275

University of Pennsylvania                       Fax: (215) 573-2175

3600 Market St., Suite 810                         <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

 Philadelphia, PA 19104 USA                   <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>

</div>

<pre class="moz-signature" cols="72">

</pre>

</body>

</html>