[Corpora-List] News from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Tue Aug 23 14:56:17 UTC 2011
- *Fall 2011 LDC Data Scholarship Program <#data>* -
/ New publications:/
- *2005 Spring NIST Rich Transcription (RT-05S) Evaluation Set <#rt>* -
- *2008 NIST Speaker Recognition Evaluation Training Set Part 1 <#sre>* -
- *Arabic Treebank: Part 2 v 3.1 <#atb>* -
------------------------------------------------------------------------
*Fall 2011 LDC Data Scholarship Program*
Applications are now being accepted through September 15, 2011 for the
Fall 2011 LDC Data Scholarship program! The LDC Data Scholarship
program provides university students with access to LDC data at
no-cost. During the previous two cycles of the program, LDC has awarded
no-cost copies of LDC data valued at over US$25,000.
This program is open to students pursuing both undergraduate and
graduate studies in an accredited college or university. LDC Data
Scholarships are not restricted to any particular field of study;
however, students must demonstrate a well-developed research agenda and
a bona fide inability to pay. The selection process is highly competitive.
The application consists of two parts:
(1) *Data Use Proposal*. Applicants must submit a proposal describing
their intended use of the data. The proposal must contain the
applicant's name, university, and field of study. The proposal should
state which data the student plans to use and contain a description of
their research project.
Applicants should consult the LDC Corpus Catalog
<http://www.ldc.upenn.edu/Catalog/index.jsp> for a complete list of data
distributed by LDC. Due to certain restrictions, a handful of LDC
corpora are restricted to members of the Consortium. Applicants are
advised to select a maximum of one to two datasets; students may apply
for additional datasets during the following cycle once they have
completed processing of the initial datasets and publish or present work
in some juried venue.
(2) *Letter of Support*. Applicants must submit one letter of support
from their thesis adviser or department chair. The letter must confirm
that the department or university lacks the funding to pay the full
Non-member Fee for the data and verify the student's need for data.
For further information on application materials and program rules,
please visit the LDC Data Scholarship
<http://www.ldc.upenn.edu/About/scholarships.html> page.
Students can email their applications to the LDC Data Scholarship
program <mailto:datascholarships at ldc.upenn.edu>. Decisions will be sent
by email from the same address.
The deadline for the Fall 2011 program cycle is September 15, 2011.
*New Publications*
(1) 2005 Spring NIST Rich Transcription (RT-05S) Conference Meeting
Evaluation Set
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011S06>
was developed by LDC and the National Institute of Standards and
Technology (NIST). It contains approximately 78 hours of English meeting
speech, reference transcripts and other material used in the RT Spring
2005 evaluation
<http://www.itl.nist.gov/iad/mig/tests/rt/2005-spring/index.html>. Rich
Transcription (RT) is broadly defined as a fusion of speech-to-text
(STT) technology and metadata extraction technologies providing the
bases for the generation of more usable transcriptions of human-human
speech in meetings.
RT-05S included the following tasks in the meeting domain:
Speech-To-Text (STT) - convert spoken words into streams of text
Speaker Diarization (SPKR) - find the segments of time within a
meeting in which each meeting participant is talking
Speech Activity Detection (SAD) - detect when someone in a meeting
space is talking
Further information about the evaluation is available on the RT-05
Spring Evaluation Website
<http://www.itl.nist.gov/iad/mig/tests/rt/2005-spring/index.html>.
The data in this release consists of portions of meeting speech
collected between 2001 and 2005 by the IDIAP Research Institute's
Augmented Multi-Party Interaction project (AMI), Martigny, Switzerland;
International Computer Science Institute (ICSI) at University of
California, Berkeley; Interactive Systems Laboratories (ISL) at Carnegie
Mellon University (CMU), Pittsburgh, PA; NIST; and Virginia Polytechnic
Institute and State University (VT), Blacksburg, VA. Each meeting
excerpt contains a head-mic recording for each subject and one or more
distant microphone recordings.
Reference transcripts for the evaluation excerpts were prepared by LDC
according to its Meeting Recording Careful Transcription Guidelines.
Those specifications are designed to provide an accurate, verbatim
(word-for-word) transcription, time-aligned with the audio file and
including the identification of additional audio and speech signals with
special mark-up.
*
(2) 2008 NIST Speaker Recognition Evaluation Training Set Part 1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011S05>
was developed by LDC and the National Institute of Standards and
Technology (NIST). It contains 640 hours of multilingual telephone
speech and English interview speech along with transcripts and other
materials used as training data in the 2008 NIST Speaker Recognition
Evaluation (SRE)
<http://www.itl.nist.gov/iad/mig/tests/spk/2008/index.html>.
SRE is part of an ongoing series of evaluations conducted by NIST. These
evaluations are an important contribution to the direction of research
efforts and the calibration of technical capabilities. They are intended
to be of interest to all researchers working on the general problem of
text independent speaker recognition.
The 2008 evaluation was distinguished from prior evaluations, in
particular those in 2005 and 2006, by including not only conversational
telephone speech data but also conversational speech data of comparable
duration recorded over a microphone channel involving an interview scenario.
The speech data in this release was collected in 2007 by LDC at its
Human Subjects Data Collection Laboratories
<http://www.ldc.upenn.edu/About/facilities.shtml> in Philadelphia and by
the International Computer Science Institute
<http://www.icsi.berkeley.edu/> (ICSI) at the University of California,
Berkley. This collection was part of the Mixer 5
<http://projects.ldc.upenn.edu/Mixer/> project, which was designed to
support the development of robust speaker recognition technology by
providing carefully collected and audited speech from a large pool of
speakers recorded simultaneously across numerous microphones and in
different communicative situations and/or in multiple languages. Mixer
participants were native English and bilingual English speakers. The
telephone speech in this corpus is predominately English; all interview
segments are in English. Telephone speech represents approximately 565
hours of the data, where as microphone speech represents the other 75 hours.
The telephone speech segments include excerpts in the range of 8-12
seconds and 5 minutes from longer original conversations. The interview
material includes short conversation interview segments of approximately
3 minutes from a longer interview session. English language transcripts
in .cfm format were produced using an automatic speech recognition (ASR)
system.
*
(3) Arabic Treebank: Part 2 (ATB2) v 3.1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011T09>
was developed at LDC. It consists of 501 newswire stories from Ummah
Press with part-of-speech (POS), morphology, gloss and syntactic
treebank annotation in accordance with the Penn Arabic Treebank (PATB)
Guidelines <http://projects.ldc.upenn.edu/ArabicTreebank/> developed in
2008 and 2009. This release represents a significant revision of LDC's
previous ATB2 publication: Arabic Treebank: Part 2 v 2.0 LDC2004T02
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T02>.
The ongoing PATB project supports research in Arabic-language natural
language processing and human language technology development. The
methodology and work leading to the release of this publication are
described in detail in the documentation accompanying this corpus and in
two research papers: Enhancing the Arabic Treebank: A Collaborative
Effort toward New Annotation Guidelines
<http://papers.ldc.upenn.edu/LREC2008/Enhancing_Arabic_Treebank.pdf> and
Consistent and Flexible Integration of Morphological Annotation in the
Arabic Treebank
<http://papers.ldc.upenn.edu/LREC2010/KulickBiesMaamouri-LREC2010.pdf>.
ATB2 v 3.1 contains a total of 144,199 source tokens before clitics are
split, and 169,319 tree tokens after clitics are separated for the
treebank annotation. Source texts were selected from Ummah Press news
archives covering the period from July 2001 through September 2002.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110823/942d4a3d/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list