[Corpora-List] New from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Fri May 21 17:16:21 UTC 2010
/In this newsletter:/*
- Coming Soon: LDC Data Scholarship Program! <#data> -*
/New publications:
/
LDC2010S03
*- 2003 NIST Speaker Recognition Evaluation <#2003sre> -*
LDC2010T09
*- ACE 2005 Mandarin SpatialML Annotations <#ace2005> -*
LDC2010T10
*- NIST 2002 Open Machine Translation (OpenMT) Evaluation <#2002mt> -***
------------------------------------------------------------------------
* *
*Coming Soon: LDC Data Scholarship Program!*
We are pleased to announce that the LDC Data Scholarship program is in
the works! This program will provide university students with access to
LDC data at no-cost. Each year LDC distributes thousands of dollars
worth of data at no- or reduced-cost to students who demonstrate a need
for data, yet cannot secure funding. LDC will formalize this practice
through the newly created LDC Data Scholarship program.
Data scholarships will be offered each semester beginning with the fall
2010 semester (September - December 2010). Students will need to
complete an application, which will include a data use proposal and
letter of support from their faculty adviser. We anticipate that the
selection process will be highly competitive.
Stay tuned for further announcements in our newsletter and on our home page!
[ top <#top>]
*New Publications*
* *
(1) 2003 NIST Speaker Recognition Evaluation
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010S03>
was developed by researchers at NIST (National Institute of Standards
and Technology). It consists of just over 120 hours of English
conversational telephone speech used as training data and test data in
the 2003 Speaker Recognition Evaluation (SRE), along with evaluation
metadata and test set answer keys.
2003 NIST Speaker Recognition Evaluation is part of an ongoing series of
yearly evaluations conducted by NIST. These evaluations provide an
important contribution to the direction of research efforts and the
calibration of technical capabilities. They are intended to be of
interest to all researchers working on the general problem of text
independent speaker recognition. To this end the evaluation was designed
to be simple, to focus on core technology issues, to be fully supported,
and to be accessible to those wishing to participate.
This speaker recognition evaluation focused on the task of 1-speaker and
2-speaker detection, in the context of conversational telephone speech.
The original evaluation consisted of three parts: 1-speaker detection
"limited data", 2-speaker detection "limited data", and 1-speaker
detection "extended data". This corpus contains training and test data
and supporting metadata (including answer keys) for only the 1-speaker
"limited data" and 2-speaker "limited data" components of the original
evaluation. The 1-speaker "extended data" component of the original
evaluation (not included in this corpus) provided metadata only, to be
used in conjunction with data from Switchboard-2 Phase II (LDC99S79)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99S79>
and Switchboard-2 Phase III Audio (LDC2002S06)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002S06>.
The metadata (resources and answer keys) for the 1-speaker "extended
data" component of the original 2003 SRE evaluation are available from
the NIST Speech Group website for the 2003 Speaker Recognition
Evaluation <http://www.itl.nist.gov/iad/mig/tests/sre/2003/index.html>.
The data in this corpus is a 120-hour subset of data first made
available to the public as Switchboard Cellular Part 2 Audio
(LDC2004S07)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S07>,
reorganized specifically for use in the 2003 NIST SRE.
[ top <#top>]
*
(2) ACE 2005 Mandarin SpatialML Annotations
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T09>
was developed by researchers at The MITRE Corporation
<http://www.mitre.org/> (MITRE). ACE 2005 Mandarin SpatialML Annotations
applies SpatialML tags to a subset of the source Mandarin training data
in ACE 2005 Multilingual Training Corpus (LDC2006T06).
SpatialML is a mark-up language for representing spatial expressions in
natural language documents. SpatialML focuses is on geography and
culturally-relevant landmarks, rather than biology, cosmology, geology,
or other regions of the spatial language domain. The goal is to allow
for better integration of text collections with resources such as
databases that provide spatial information about a domain, including
gazetteers, physical feature databases and mapping services.
The SpatialML annotation scheme is intended to emulate earlier progress
on time expressions such as TIMEX2 <http://fofoca.mitre.org/>, TimeML
<http://www.timeml.org/site/index.html>, and the 2005 ACE guidelines
<http://www.itl.nist.gov/iad/mig/tests/ace/2005/doc/ace05eval_official_results_20060110.html>.
The main SpatialML tag is the PLACE tag which encodes information about
location. The central goal of SpatialML is to map location information
in text to data from gazetteers and other databases to the extent
possible by defining attributes in the PLACE tag. Therefore, semantic
attributes such as country abbreviations, country subdivision and
dependent area abbreviations (e.g., US states), and geo-coordinates are
used to help establish such a mapping. The SpatialML guidelines are
compatible with existing guidelines for spatial annotation and existing
corpora within the ACE research program.
This corpus consists of a 298-document subset of broadcast material from
the ACE 2005 Multilingual Training Corpus (LDC2006T06) that has been
tagged by a native Mandarin speaker according to version 2.3 of the
SpatialML annotation guidelines, which are included in the documentation
for this release.
[ top <#top>]
*
(3) NIST 2002 Open Machine Translation (OpenMT) Evaluation
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T10>
is a package containing source data, reference translations, and scoring
software used in the NIST 2002 OpenMT evaluation. It is designed to help
evaluate the effectiveness of machine translation systems. The package
was compiled and scoring software was developed by researchers at NIST,
making use of newswire source data and reference translations collected
and developed by LDC.
The objective of the NIST OpenMT evaluation series is to support
research in, and help advance the state of the art of, machine
translation (MT) technologies -- technologies that translate text
between human languages. Input may include all forms of text. The goal
is for the output to be an adequate and fluent translation of the
original. Additional information about these evaluations may be found at
the NIST Open Machine Translation (OpenMT) Evaluation web site
<http://www.itl.nist.gov/iad/mig/tests/mt/>.
This evaluation kit includes a single perl script that may be used to
produce a translation quality score for one (or more) MT systems. The
script works by comparing the system output translation with a set of
(expert) reference translations of the same source text. Comparison is
based on finding sequences of words in the reference translations that
match word sequences in the system output translation.
The Chinese-language source text included in this corpus is a
reorganization of data that was initially released to the public as
Multiple-Translation Chinese (MTC) Part 2 (LDC2003T17)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T17>.
The Chinese-language reference translations are a reorganized subset of
data from the same MTC corpus. The Arabic-language data (source text and
reference translations) is a reorganized subset of data that was
initially released to the public as Multiple-Translation Arabic (MTA)
Part 1 (LDC2003T18)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T18>.
All source data for this corpus is newswire text.
For each language, the test set consists of two files, a source and a
reference file. Each reference file contains four independent
translations of the data set. The evaluation year, source language, test
set, version of the data, and source vs. reference file are reflected in
the file name.
[ top <#top>]
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100521/64d9588e/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list