[Corpora-List] New from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Fri May 21 17:16:21 UTC 2010


/In this newsletter:/*

- Coming Soon: LDC Data Scholarship Program! <#data> -*

/New publications:
/
LDC2010S03
*- 2003 NIST Speaker Recognition Evaluation <#2003sre> -*

LDC2010T09
*- ACE 2005 Mandarin SpatialML Annotations <#ace2005> -*

LDC2010T10
*- NIST 2002 Open Machine Translation (OpenMT) Evaluation <#2002mt> -***

------------------------------------------------------------------------

* *

*Coming Soon: LDC Data Scholarship Program!*

We are pleased to announce that the LDC Data Scholarship program is in 
the works! This program will provide university students with access to 
LDC data at no-cost. Each year LDC distributes thousands of dollars 
worth of data at no- or reduced-cost to students who demonstrate a need 
for data, yet cannot secure funding.  LDC will formalize this practice 
through the newly created LDC Data Scholarship program.

Data scholarships will be offered each semester beginning with the fall 
2010 semester (September - December 2010). Students will need to 
complete an application, which will include a data use proposal and 
letter of support from their faculty adviser.  We anticipate that the 
selection process will be highly competitive.

Stay tuned for further announcements in our newsletter and on our home page!


[ top <#top>]

*New Publications*

* *

(1) 2003 NIST Speaker Recognition Evaluation 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010S03> 
was developed by researchers at NIST (National Institute of Standards 
and Technology). It consists of just over 120 hours of English 
conversational telephone speech used as training data and test data in 
the 2003 Speaker Recognition Evaluation (SRE), along with evaluation 
metadata and test set answer keys.

2003 NIST Speaker Recognition Evaluation is part of an ongoing series of 
yearly evaluations conducted by NIST. These evaluations provide an 
important contribution to the direction of research efforts and the 
calibration of technical capabilities. They are intended to be of 
interest to all researchers working on the general problem of text 
independent speaker recognition. To this end the evaluation was designed 
to be simple, to focus on core technology issues, to be fully supported, 
and to be accessible to those wishing to participate.

This speaker recognition evaluation focused on the task of 1-speaker and 
2-speaker detection, in the context of conversational telephone speech.  
The original evaluation consisted of three parts: 1-speaker detection 
"limited data", 2-speaker detection "limited data", and 1-speaker 
detection "extended data". This corpus contains training and test data 
and supporting metadata (including answer keys) for only the 1-speaker 
"limited data" and 2-speaker "limited data" components of the original 
evaluation. The 1-speaker "extended data" component of the original 
evaluation (not included in this corpus) provided metadata only, to be 
used in conjunction with data from Switchboard-2 Phase II (LDC99S79) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99S79> 
and Switchboard-2 Phase III Audio (LDC2002S06) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002S06>. 
The metadata (resources and answer keys) for the 1-speaker "extended 
data" component of the original 2003 SRE evaluation are available from 
the NIST Speech Group website for the 2003 Speaker Recognition 
Evaluation <http://www.itl.nist.gov/iad/mig/tests/sre/2003/index.html>.

The data in this corpus is a 120-hour subset of data first made 
available to the public as Switchboard Cellular Part 2 Audio 
(LDC2004S07) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S07>, 
reorganized specifically for use in the 2003 NIST SRE.


[ top <#top>]

*

(2)  ACE 2005 Mandarin SpatialML Annotations 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T09> 
was developed by researchers at The MITRE Corporation 
<http://www.mitre.org/> (MITRE). ACE 2005 Mandarin SpatialML Annotations 
applies SpatialML tags to a subset of the source Mandarin training data 
in ACE 2005 Multilingual Training Corpus (LDC2006T06).

SpatialML is a mark-up language for representing spatial expressions in 
natural language documents. SpatialML focuses is on geography and 
culturally-relevant landmarks, rather than biology, cosmology, geology, 
or other regions of the spatial language domain. The goal is to allow 
for better integration of text collections with resources such as 
databases that provide spatial information about a domain, including 
gazetteers, physical feature databases and mapping services.

The SpatialML annotation scheme is intended to emulate earlier progress 
on time expressions such as TIMEX2 <http://fofoca.mitre.org/>, TimeML 
<http://www.timeml.org/site/index.html>, and the 2005 ACE guidelines 
<http://www.itl.nist.gov/iad/mig/tests/ace/2005/doc/ace05eval_official_results_20060110.html>. 
The main SpatialML tag is the PLACE tag which encodes information about 
location. The central goal of SpatialML is to map location information 
in text to data from gazetteers and other databases to the extent 
possible by defining attributes in the PLACE tag. Therefore, semantic 
attributes such as country abbreviations, country subdivision and 
dependent area abbreviations (e.g., US states), and geo-coordinates are 
used to help establish such a mapping. The SpatialML guidelines are 
compatible with existing guidelines for spatial annotation and existing 
corpora within the ACE research program.

This corpus consists of a 298-document subset of broadcast material from 
the ACE 2005 Multilingual Training Corpus (LDC2006T06) that has been 
tagged by a native Mandarin speaker according to version 2.3 of the 
SpatialML annotation guidelines, which are included in the documentation 
for this release.


[ top <#top>]


*

(3)  NIST 2002 Open Machine Translation (OpenMT) Evaluation 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T10> 
is a package containing source data, reference translations, and scoring 
software used in the NIST 2002 OpenMT evaluation. It is designed to help 
evaluate the effectiveness of machine translation systems. The package 
was compiled and scoring software was developed by researchers at NIST, 
making use of newswire source data and reference translations collected 
and developed by LDC.

The objective of the NIST OpenMT evaluation series is to support 
research in, and help advance the state of the art of, machine 
translation (MT) technologies -- technologies that translate text 
between human languages. Input may include all forms of text. The goal 
is for the output to be an adequate and fluent translation of the 
original. Additional information about these evaluations may be found at 
the NIST Open Machine Translation (OpenMT) Evaluation web site 
<http://www.itl.nist.gov/iad/mig/tests/mt/>.

This evaluation kit includes a single perl script that may be used to 
produce a translation quality score for one (or more) MT systems. The 
script works by comparing the system output translation with a set of 
(expert) reference translations of the same source text. Comparison is 
based on finding sequences of words in the reference translations that 
match word sequences in the system output translation.

The Chinese-language source text included in this corpus is a 
reorganization of data that was initially released to the public as 
Multiple-Translation Chinese (MTC) Part 2 (LDC2003T17) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T17>. 
The Chinese-language reference translations are a reorganized subset of 
data from the same MTC corpus. The Arabic-language data (source text and 
reference translations) is a reorganized subset of data that was 
initially released to the public as Multiple-Translation Arabic (MTA) 
Part 1 (LDC2003T18) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T18>. 
All source data for this corpus is newswire text.

For each language, the test set consists of two files, a source and a 
reference file. Each reference file contains four independent 
translations of the data set. The evaluation year, source language, test 
set, version of the data, and source vs. reference file are reflected in 
the file name.



[ top <#top>]

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                  http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100521/64d9588e/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list