[Corpora-List] New from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Wed Jan 27 16:16:40 UTC 2010
*
*/New Publications:/*
*
LDC2010T02
- *Czech Broadcast News MDE Transcripts - <#czech>*
LDC2010T03
- *GALE Phase 1 Chinese Newsgroup Parallel Text - Part 2 <#gale>* -
LDC2010T01
- *NIST Open Machine Translation 2008 Evaluation (MT08) Selected
Reference and System Translations <#nist>* -
*
*
------------------------------------------------------------------------
*
*
*New Publications*
(1)Czech Broadcast News MDE Transcripts
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T02>
was prepared by researchers at the University of West Bohemia, Pilsen,
Czech Republic. It consists of metadata extraction (MDE) annotations for
the approximately 26 hours of transcribed broadcast news speech in Czech
Broadcast News Transcripts (LDC2004T01)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T01>.
The audio files corresponding to the transcripts in this corpus are
contained in Czech Broadcast News Speech (LDC2004S01)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S01>.
Czech Broadcast News MDE Transcripts joins LDC's other holdings of Czech
broadcast data: Czech Broadcast Conversation Speech (LDC2009S02)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S02>,
Czech Broadcast Conversation MDE Transcripts (LDC2009T20)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T20>,
Voice of America (VOA) Czech Broadcast News Audio (LDC2000S89)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2000S89>
and Voice of America (VOA) Czech Broadcast News Transcripts (LDC2000T53)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2000T53>.
The audio recordings were collected from February 1, 2000 through April
22, 2000 from three Czech radio stations and two television stations.
The broadcasts included both public and commercial subjects and were
presented in various styles, ranging from a formal style to a colloquial
style more typical for commercial broadcast companies that do not
primarily focus on news.
The goal of MDE research is to take raw speech recognition output and
refine it into forms that are of more use to humans and to downstream
automatic processes. In simple terms, this means the creation of
automatic transcripts that are maximally readable. This readability
might be achieved in a number of ways: removing non-content words like
filled pauses and discourse markers from the text; removing sections of
disfluent speech; and creating boundaries between natural breakpoints in
the flow of speech so that each sentence or other meaningful unit of
speech might be presented on a separate line within the resulting
transcript. Natural capitalization, punctuation, standardized spelling
and sensible conventions for representing speaker turns and identity are
further elements in the readable transcript.
The transcripts and annotations in this corpus are stored in two
formats: QAn (Quick Annotator) <http://www.mde.zcu.cz/qan.html>, and
RTTM. Character encoding in all files is ISO-8859-2.
[ top <#top>]
*
(2) GALE Phase 1 Chinese Newsgroup Parallel Text - Part 2
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T03>
was prepared by LDC and contains 223,000 characters (98 files) of
Chinese newsgroup text and its translation selected from twenty-one
sources. Newsgroups consist of posts to electronic bulletin boards,
Usenet newsgroups, discussion groups and similar forums. This release
was used as training data in Phase 1 (year 1) of the DARPA-funded GALE
program.
Preparing the source data involved four stages of work: data scouting,
data harvesting, formating and data selection.
Data scouting involved manually searching the web for suitable newsgroup
text. Data scouts were assigned particular topics and genres along with
a production target in order to focus their web search. Formal
annotation guidelines and a customized annotation toolkit helped data
scouts to manage the search process and to track progress.
Data scouts logged their decisions about potential text of interest to a
database. A nightly process queried the annotation database and
harvested all designated URLs. Whenever possible, the entire site was
downloaded, not just the individual thread or post located by the data
scout. Once the text was downloaded, its format was standardized so that
the data could be more easily integrated into downstream annotation
processes. Typically, a new script was required for each new domain name
that was identified. After scripts were run, an optional manual process
corrected any remaining formatting problems.
The selected documents were then reviewed for content-suitability using
a semi-automatic process. A statistical approach was used to rank a
document's relevance to a set of already-selected documents labeled as
"good." An annotator then reviewed the list of relevance-ranked
documents and selected those which were suitable for a particular
annotation task or for annotation in general. These newly-judged
documents in turn provided additional input for the generation of new
ranked lists.
Manual sentence units/segments (SU) annotation was also performed as
part of the transcription task. Three types of end of sentence SU were
identified: statement SU, question SU, and incomplete SU. After
transcription and SU annotation, files were reformatted into a
human-readable translation format and assigned to professional
translators for careful translation. Translators followed LDC's GALE
Translation guidelines which describe the makeup of the translation
team, the source data format, the translation data format, best
practices for translating certain linguistic features and quality
control procedures applied to completed translations.
[ top <#top>]
*
(3) NIST Open Machine Translation 2008 Evaluation (MT08) Selected
Reference and System Translations
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T01>.
NIST Open MT <http://www.itl.nist.gov/iad/mig/tests/mt/> is an
evaluation series to support research in, and help advance the state of
the art of, technologies that translate text between human language
s. Participants submit machine translation output of source language
data to NIST (National Institute of Standards and Technology); the
output is then evaluated with automatic and manual measures of quality
against high quality human translations of the same source data. This
program supports the growing interest in system combination approaches
that generate improved translations from output of several different
machine translation (MT) systems. MT system combination approaches
require data sets composed of high-quality human reference translations
and a variety of machine translations of the same text. The NIST Open
Machine Translation 2008 Evaluation (MT08) Selected Reference and System
Translations set addresses this need.
The data in this release consists of the human reference translations
and corresponding machine translations for the NIST Open MT08
<http://www.itl.nist.gov/iad/mig/tests/mt/2008/> test sets, which
consist of newswire and web data in the four MT08 language pairs:
Arabic-to-English, Chinese-to-English, English-to-Chinese (newswire
only) and Urdu-to-English. Two documents per language pair and genre
were removed at random from the test sets for release. For the machine
translations, only output from one submission per training condition
(Constrained and Unconstrained training, where available) per
participant is included. See section 2 of the MT08 Evaluation Plan for a
description of the training conditions. The resulting data set has the
following characteristics:
* Arabic-to-English: 120 documents with 1312 segments, output from
17 machine translation systems.
* Chinese-to-English: 105 documents with 1312 segments, output from
23 machine translation systems.
* English-to-Chinese: 127 documents with 1830 segments, output from
11 machine translation systems.
* Urdu-to-English: 128 documents with 1794 segments, output from 12
machine translation systems.
The data is organized and annotated in such a way that subsets for each
language pair and/or data genre and/or training condition can be
extracted and used separately, depending on the user's needs.
[ top <#top>]
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100127/46fa4233/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list