[Corpora-List] New from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Wed Mar 25 21:19:26 UTC 2009
LDC2009T05
*- 2008 NIST Metrics for Machine Translation (MetricsMATR08)
Development Data
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T05> -*
LDC2009T06
*- GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T06> -*
LDC2009T07
*- Unified Linguistic Annotation Text Collection
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T07> -*
- Additional Free LDC Resources -
The Linguistic Data Consortium (LDC) would like to announce the
availability of three new publications and highlight free LDC resources.
------------------------------------------------------------------------
*New Publications*
* *
(1) 2008 NIST Metrics for Machine Translation (MetricsMATR08)
Development Data
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T05>
contains data, reference translations, and software used for NIST
MetricsMATR <http://www.nist.gov/speech/tests/metricsmatr/>. NIST
MetricsMATR is a series of research challenge events for machine
translation (MT) metrology, promoting the development of innovative,
even revolutionary, MT metrics that correlate highly with human
assessments of MT quality. In this program, participants submit their
metrics to the National Institute of Standards and Technology (NIST)
<http://www.nist.gov>. NIST runs those metrics on certain held-back test
data for which it has human assessments measuring quality and then
calculates correlations between the automatic metric scores and the
human assessments.
In the NIST Metrics for Machine Translation 2008 Evaluation
(MetricsMATR08) <http://www.nist.gov/speech/tests/metricsmatr/2008/>,
participants received as development data a subset of the materials used
in the NIST Open MT06 evaluation
<http://nist.gov/speech/tests/mt/2006/>, specifically, human reference
translations, system translations, and human assessments of adequacy and
preference. The source data was comprised of twenty-five Arabic language
newswire documents with a total of 249 segments. The data in each
segment consisted of four human reference translations in English and
system translations from eight different MT06 machine translation
systems. In addition to the data and reference translations, this
release includes software tools for evaluation and reporting and
documentation describing how the human assessments were obtained and how
they are represented in the data. The evaluation plan
<http://www.nist.gov/speech/tests/metricsmatr/2008/doc/mm08_evalplan_v1.1.pdf>
contains further information and rules on the use of this data.
The MetricsMATR program seeks to overcome several drawbacks to the
methods employed for the evaluation of MT technology. Currently,
automatic metrics have not yet proved able to predict the usefulness and
reliability of MT technologies with confidence. Nor have automatic
metrics demonstrated that they are meaningful in target languages other
than English. Human assessments, however, are expensive, slow,
subjective and difficult to standardize. These problems, and the need to
overcome them through the development of improved automatic (or even
semi-automatic) metrics, have been a constant point of discussion at
past NIST MT evaluation events. MetricsMATR aims to provide a platform
to address these shortcomings.
***
(2) GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T06>
contains transcripts and English translations of 24 hours of Chinese
broadcast conversation programming from China Central TV (CCTV), Phoenix
TV and Voice of America (VOA). It does not contain the audio files from
which the transcripts and translations were generated. This release,
along with other corpora, was used as training data in Phase 1 (year 1)
of the DARPA-funded GALE program. GALE Phase 1 Chinese Broadcast
Conversation Parallel Text - Part 1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T02>
was released in January 2009.
A manual selection procedure was used to choose data appropriate for the
GALE program, namely, conversation (talk) programs focusing on current
events. Stories on topics such as sports, entertainment and business
were excluded from the data set.
The selected audio snippets were carefully transcribed by LDC annotators
and professional transcription agencies following LDC's Quick Rich
Transcription specification. Manual sentence units/segments (SU)
annotation was also performed as part of the transcription task. Three
types of end of sentence SU were identified: statement SU, question SU,
and incomplete SU.
After transcription and SU annotation, files were reformatted into a
human-readable translation format and assigned to professional
translators for careful translation. Translators followed LDC's GALE
Translation guidelines which describe the makeup of the translation
team, the source data format, the translation data format, best
practices for translating certain linguistic features (such as names and
speech disfluencies) and quality control procedures applied to completed
translations.
***
*
* (3) The Unified Linguistic Annotation Text Collection
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T07>
consists of two datasets: the Language Understanding Annotation Corpus
(LDC2009T10) and Reflex Entity Translation Training Dev/Test
(LDC2009T11). Most recent annotation efforts for language have focused
on small pieces of the larger problem of semantic annotation rather than
producing a single unified representation. The Unified Linguistic
Annotation (ULA) project, sponsored by the National Science Foundation
<http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0551615>, seeks
to integrate into one framework different layers of annotation (e.g.,
semantics, discourse, temporal, opinions) using various existing
resources, including PropBank
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T14>,
NomBank
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T23>,
TimeBank
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T08>,
Penn Discourse Treebank
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T05>
and coreference and opinion annotations. The project represents a
concerted effort of researchers from several institutions to develop a
large word corpus with balanced and annotated data. The Unified
Linguistic Annotation Text Collection is provided as a resource for the
ULA effort. It consists of two datasets:
* The Language Understanding Annotation Corpus
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T10>
(LDC2009T10). The Language Understanding Annotation Corpus was
developed at the Johns Hopkins Center of Excellence in Human
Language Technology <http://web.jhu.edu/hltcoe>. It consists of
over 9000 words of English text (6949 words) and Arabic text (2183
words) annotated for committed belief, event and entity,
coreference, dialog acts and temporal relations. The materials
were chosen from various sources to represent "informal input,"
that is, text that contains colloquial forms. The documents in the
corpus include excerpts from newswire stories, telephone
conversation transcripts, emails, contracts and written instructions.
* REFLEX Entity Translation Training/DevTest
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T11>
(LDC2009T11). REFLEX Entity Translation Training/DevTest is the
complete set of training data and development test data for the
2007 REFLEX Entity Translation evaluation
<http://www.nist.gov/speech/tests/ace/2007/> sponsored by the
National Institute of Standards and Technology (NIST). It contains
approximately 67.5K words of newswire and weblog text for each of
English, Chinese and Arabic (or approximately 22.5K words in each
language) translated into each of the other two languages. The
data is annotated for entities and TIMEX2 extents and normalization.
Researchers may license this data by completing the LDC User Agreement
for Non-members
<http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf>.
The agreement can be faxed to +1 215 573 2175 or scanned and emailed to
this address. Please indicate on the license whether you are requesting
the entire collection (LDC2009T07) or just one dataset (LDC2009T10 or
LDC2009T11). The collection is being made available at no charge.
*Additional Free LDC Resources
*
LDC is pleased to distribute the Unified Linguistic Annotation Text
Collection (LDC2009T07) corpora at no cost to support the work of the
ULA project. As mentioned above, to license a copy of this data,
non-members should complete the LDC User Agreement for Non-members
<http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf>
and fax to +1 215 573 2175 or scan and email to this address. On the
heels of the release of the ULA corpora, LDC would like to highlight
other resources which are available at no cost. Free grant-covered
copies of the following Talkbank <http://www.talkbank.org/> databases
can be licensed from LDC:
* LDC2003V01 FORM2 Kinematic Gesture
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003V01>
* LDC2003L01 Grassfields Bantu Fieldwork: Dschang Lexicon
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003L01>
* LDC2003S02 Grassfields Bantu Fieldwork: Dschang Tone Paradigms
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003S02>
* LDC2001S16 Grassfields Bantu Fieldwork: Ngomba Tone Paradigms
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001S16>
* LDC2004L01 Klex: Finite-State Lexical Transducer for Korean
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004L01>
* LDC2004T03 Morphologically Annotated Korean Text
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T03>
* LDC2003S06 Santa Barbara Corpus of Spoken American English Part
II
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003S06>
* LDC2004S10 Santa Barbara Corpus of Spoken American English Part
III
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S10>
* L
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S25>DC2005S25
Santa Barbara Corpus of Spoken American English Part IV
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S25>
* LDC2003T15 SLX Corpus of Classic Sociolinguistic Interviews
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T15>
* LDC2004S12 TalkBank Ethology Data: Field Recordings of Vervet
Monkey Calls
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S12>
A US$30 shipping and handling fee applies for data on disc. Further
information, including additional free datasets such as TimeBank 1.2
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T08>,
and useful tools such as LDC's parallel text sentence aligner,
Champollion <http://sourceforge.net/projects/champollion/>, can be found
in our What's New! What's Free! Archive
<http://www.ldc.upenn.edu/About/whatsnew.shtml>.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090325/a9385b9c/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list