[Corpora-List] New from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Wed Jan 27 16:16:40 UTC 2010


*
*/New Publications:/*
*
LDC2010T02
- *Czech Broadcast News MDE Transcripts - <#czech>*

LDC2010T03
- *GALE Phase 1 Chinese Newsgroup Parallel Text - Part 2 <#gale>* -

LDC2010T01
- *NIST Open Machine Translation 2008 Evaluation (MT08) Selected 
Reference and System Translations <#nist>* -
*
*
------------------------------------------------------------------------
*
*

*New Publications*

(1)Czech Broadcast News MDE Transcripts 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T02> 
was prepared by researchers at the University of West Bohemia, Pilsen, 
Czech Republic. It consists of metadata extraction (MDE) annotations for 
the approximately 26 hours of transcribed broadcast news speech in Czech 
Broadcast News Transcripts (LDC2004T01) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T01>. 
The audio files corresponding to the transcripts in this corpus are 
contained in Czech Broadcast News Speech (LDC2004S01) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S01>. 
Czech Broadcast News MDE Transcripts joins LDC's other holdings of Czech 
broadcast data: Czech Broadcast Conversation Speech (LDC2009S02) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S02>, 
Czech Broadcast Conversation MDE Transcripts (LDC2009T20) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T20>, 
Voice of America (VOA) Czech Broadcast News Audio (LDC2000S89) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2000S89> 
and Voice of America (VOA) Czech Broadcast News Transcripts (LDC2000T53) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2000T53>.

The audio recordings were collected from February 1, 2000 through April 
22, 2000 from three Czech radio stations and two television stations. 
The broadcasts included both public and commercial subjects and were 
presented in various styles, ranging from a formal style to a colloquial 
style more typical for commercial broadcast companies that do not 
primarily focus on news.

The goal of MDE research is to take raw speech recognition output and 
refine it into forms that are of more use to humans and to downstream 
automatic processes. In simple terms, this means the creation of 
automatic transcripts that are maximally readable. This readability 
might be achieved in a number of ways: removing non-content words like 
filled pauses and discourse markers from the text; removing sections of 
disfluent speech; and creating boundaries between natural breakpoints in 
the flow of speech so that each sentence or other meaningful unit of 
speech might be presented on a separate line within the resulting 
transcript. Natural capitalization, punctuation, standardized spelling 
and sensible conventions for representing speaker turns and identity are 
further elements in the readable transcript.

The transcripts and annotations in this corpus are stored in two 
formats: QAn (Quick Annotator) <http://www.mde.zcu.cz/qan.html>, and 
RTTM. Character encoding in all files is ISO-8859-2.


[ top <#top>]


*

(2) GALE Phase 1 Chinese Newsgroup Parallel Text - Part 2 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T03> 
was prepared by LDC and contains 223,000 characters (98 files) of 
Chinese newsgroup text and its translation selected from twenty-one 
sources. Newsgroups consist of posts to electronic bulletin boards, 
Usenet newsgroups, discussion groups and similar forums. This release 
was used as training data in Phase 1 (year 1) of the DARPA-funded GALE 
program.

Preparing the source data involved four stages of work: data scouting, 
data harvesting, formating and data selection.

Data scouting involved manually searching the web for suitable newsgroup 
text. Data scouts were assigned particular topics and genres along with 
a production target in order to focus their web search. Formal 
annotation guidelines and a customized annotation toolkit helped data 
scouts to manage the search process and to track progress.

Data scouts logged their decisions about potential text of interest to a 
database. A nightly process queried the annotation database and 
harvested all designated URLs. Whenever possible, the entire site was 
downloaded, not just the individual thread or post located by the data 
scout. Once the text was downloaded, its format was standardized so that 
the data could be more easily integrated into downstream annotation 
processes. Typically, a new script was required for each new domain name 
that was identified. After scripts were run, an optional manual process 
corrected any remaining formatting problems.

The selected documents were then reviewed for content-suitability using 
a semi-automatic process. A statistical approach was used to rank a 
document's relevance to a set of already-selected documents labeled as 
"good." An annotator then reviewed the list of relevance-ranked 
documents and selected those which were suitable for a particular 
annotation task or for annotation in general. These newly-judged 
documents in turn provided additional input for the generation of new 
ranked lists.

Manual sentence units/segments (SU) annotation was also performed as 
part of the transcription task. Three types of end of sentence SU were 
identified: statement SU, question SU, and incomplete SU. After 
transcription and SU annotation, files were reformatted into a 
human-readable translation format and assigned to professional 
translators for careful translation. Translators followed LDC's GALE 
Translation guidelines which describe the makeup of the translation 
team, the source data format, the translation data format, best 
practices for translating certain linguistic features and quality 
control procedures applied to completed translations.


[ top <#top>]

*

(3) NIST Open Machine Translation 2008 Evaluation (MT08) Selected 
Reference and System Translations 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T01>. 
NIST Open MT <http://www.itl.nist.gov/iad/mig/tests/mt/> is an 
evaluation series to support research in, and help advance the state of 
the art of, technologies that translate text between human language

s. Participants submit machine translation output of source language 
data to NIST (National Institute of Standards and Technology); the 
output is then evaluated with automatic and manual measures of quality 
against high quality human translations of the same source data. This 
program supports the growing interest in system combination approaches 
that generate improved translations from output of several different 
machine translation (MT) systems. MT system combination approaches 
require data sets composed of high-quality human reference translations 
and a variety of machine translations of the same text. The NIST Open 
Machine Translation 2008 Evaluation (MT08) Selected Reference and System 
Translations set addresses this need.

The data in this release consists of the human reference translations 
and corresponding machine translations for the NIST Open MT08 
<http://www.itl.nist.gov/iad/mig/tests/mt/2008/> test sets, which 
consist of newswire and web data in the four MT08 language pairs:  
Arabic-to-English, Chinese-to-English, English-to-Chinese (newswire 
only) and Urdu-to-English. Two documents per language pair and genre 
were removed at random from the test sets for release. For the machine 
translations, only output from one submission per training condition 
(Constrained and Unconstrained training, where available) per 
participant is included. See section 2 of the MT08 Evaluation Plan for a 
description of the training conditions. The resulting data set has the 
following characteristics:

    * Arabic-to-English: 120 documents with 1312 segments, output from
      17 machine translation systems.
    * Chinese-to-English: 105 documents with 1312 segments, output from
      23 machine translation systems.
    * English-to-Chinese: 127 documents with 1830 segments, output from
      11 machine translation systems.
    * Urdu-to-English: 128 documents with 1794 segments, output from 12
      machine translation systems.

The data is organized and annotated in such a way that subsets for each 
language pair and/or data genre and/or training condition can be 
extracted and used separately, depending on the user's needs.


[ top <#top>]

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
 Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100127/46fa4233/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list