[Corpora-List] News from LDC

Tue Aug 23 14:56:17 UTC 2011

  - *Fall 2011 LDC Data Scholarship Program <#data>*  -

/ New publications:/

- *2005 Spring NIST Rich Transcription (RT-05S) Evaluation Set <#rt>*  -

- *2008 NIST Speaker Recognition Evaluation Training Set Part 1 <#sre>*  -

- *Arabic Treebank: Part 2 v 3.1 <#atb>*  -

------------------------------------------------------------------------

*Fall 2011 LDC Data Scholarship Program*

Applications are now being accepted through September 15, 2011 for the 
Fall 2011 LDC Data Scholarship program!  The LDC Data Scholarship 
program provides university students with access to LDC data at 
no-cost.  During the previous two cycles of the program, LDC has awarded 
no-cost copies of LDC data valued at over US$25,000.

This program is open to students pursuing both undergraduate and 
graduate studies in an accredited college or university. LDC Data 
Scholarships are not restricted to any particular field of study; 
however, students must demonstrate a well-developed research agenda and 
a bona fide inability to pay. The selection process is highly competitive.

The application consists of two parts:

(1) *Data Use Proposal*. Applicants must submit a proposal describing 
their intended use of the data. The proposal must contain the 
applicant's name, university, and field of study. The proposal should 
state which data the student plans to use and contain a description of 
their research project.

Applicants should consult the LDC Corpus Catalog 
<http://www.ldc.upenn.edu/Catalog/index.jsp> for a complete list of data 
distributed by LDC.  Due to certain restrictions, a handful of LDC 
corpora are restricted to members of the Consortium.  Applicants are 
advised to select a maximum of one to two datasets; students may apply 
for additional datasets during the following cycle once they have 
completed processing of the initial datasets and publish or present work 
in some juried venue.

(2) *Letter of Support*. Applicants must submit one letter of support 
from their thesis adviser or department chair. The letter must confirm 
that the department or university lacks the funding to pay the full 
Non-member Fee for the data and verify the student's need for data.

For further information on application materials and program rules, 
please visit the LDC Data Scholarship 
<http://www.ldc.upenn.edu/About/scholarships.html> page.

Students can email their applications to the LDC Data Scholarship 
program <mailto:datascholarships at ldc.upenn.edu>. Decisions will be sent 
by email from the same address.

The deadline for the Fall 2011 program cycle is September 15, 2011.

*New Publications*

(1) 2005 Spring NIST Rich Transcription (RT-05S) Conference Meeting 
Evaluation Set 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011S06> 
was developed by LDC and the National Institute of Standards and 
Technology (NIST). It contains approximately 78 hours of English meeting 
speech, reference transcripts and other material used in the RT Spring 
2005 evaluation 
<http://www.itl.nist.gov/iad/mig/tests/rt/2005-spring/index.html>. Rich 
Transcription (RT) is broadly defined as a fusion of speech-to-text 
(STT) technology and metadata extraction technologies providing the 
bases for the generation of more usable transcriptions of human-human 
speech in meetings.

RT-05S included the following tasks in the meeting domain:

    Speech-To-Text (STT) - convert spoken words into streams of text

    Speaker Diarization (SPKR) - find the segments of time within a
    meeting in which each meeting participant is talking

    Speech Activity Detection (SAD) - detect when someone in a meeting
    space is talking

Further information about the evaluation is available on the RT-05 
Spring Evaluation Website 
<http://www.itl.nist.gov/iad/mig/tests/rt/2005-spring/index.html>.

The data in this release consists of portions of meeting speech 
collected between 2001 and 2005 by the IDIAP Research Institute's 
Augmented Multi-Party Interaction project (AMI), Martigny, Switzerland; 
International Computer Science Institute (ICSI) at University of 
California, Berkeley; Interactive Systems Laboratories (ISL) at Carnegie 
Mellon University (CMU), Pittsburgh, PA; NIST; and Virginia Polytechnic 
Institute and State University (VT), Blacksburg, VA. Each meeting 
excerpt contains a head-mic recording for each subject and one or more 
distant microphone recordings.

Reference transcripts for the evaluation excerpts were prepared by LDC 
according to its Meeting Recording Careful Transcription Guidelines. 
Those specifications are designed to provide an accurate, verbatim 
(word-for-word) transcription, time-aligned with the audio file and 
including the identification of additional audio and speech signals with 
special mark-up.

  *

  (2) 2008 NIST Speaker Recognition Evaluation Training Set Part 1 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011S05> 
was developed by LDC and the National Institute of Standards and 
Technology (NIST). It contains 640 hours of multilingual telephone 
speech and English interview speech along with transcripts and other 
materials used as training data in the 2008 NIST Speaker Recognition 
Evaluation (SRE) 
<http://www.itl.nist.gov/iad/mig/tests/spk/2008/index.html>.

SRE is part of an ongoing series of evaluations conducted by NIST. These 
evaluations are an important contribution to the direction of research 
efforts and the calibration of technical capabilities. They are intended 
to be of interest to all researchers working on the general problem of 
text independent speaker recognition.

The 2008 evaluation was distinguished from prior evaluations, in 
particular those in 2005 and 2006, by including not only conversational 
telephone speech data but also conversational speech data of comparable 
duration recorded over a microphone channel involving an interview scenario.

The speech data in this release was collected in 2007 by LDC at its 
Human Subjects Data Collection Laboratories 
<http://www.ldc.upenn.edu/About/facilities.shtml> in Philadelphia and by 
the International Computer Science Institute 
<http://www.icsi.berkeley.edu/> (ICSI) at the University of California, 
Berkley. This collection was part of the Mixer 5 
<http://projects.ldc.upenn.edu/Mixer/> project, which was designed to 
support the development of robust speaker recognition technology by 
providing carefully collected and audited speech from a large pool of 
speakers recorded simultaneously across numerous microphones and in 
different communicative situations and/or in multiple languages. Mixer 
participants were native English and bilingual English speakers. The 
telephone speech in this corpus is predominately English; all interview 
segments are in English. Telephone speech represents approximately 565 
hours of the data, where as microphone speech represents the other 75 hours.

The telephone speech segments include excerpts in the range of 8-12 
seconds and 5 minutes from longer original conversations. The interview 
material includes short conversation interview segments of approximately 
3 minutes from a longer interview session. English language transcripts 
in .cfm format were produced using an automatic speech recognition (ASR) 
system.

*

(3) Arabic Treebank: Part 2 (ATB2) v 3.1 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011T09> 
was developed at LDC. It consists of 501 newswire stories from Ummah 
Press with part-of-speech (POS), morphology, gloss and syntactic 
treebank annotation in accordance with the Penn Arabic Treebank (PATB) 
Guidelines <http://projects.ldc.upenn.edu/ArabicTreebank/> developed in 
2008 and 2009. This release represents a significant revision of LDC's 
previous ATB2 publication: Arabic Treebank: Part 2 v 2.0 LDC2004T02 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T02>.

The ongoing PATB project supports research in Arabic-language natural 
language processing and human language technology development. The 
methodology and work leading to the release of this publication are 
described in detail in the documentation accompanying this corpus and in 
two research papers: Enhancing the Arabic Treebank: A Collaborative 
Effort toward New Annotation Guidelines 
<http://papers.ldc.upenn.edu/LREC2008/Enhancing_Arabic_Treebank.pdf> and 
Consistent and Flexible Integration of Morphological Annotation in the 
Arabic Treebank 
<http://papers.ldc.upenn.edu/LREC2010/KulickBiesMaamouri-LREC2010.pdf>.

ATB2 v 3.1 contains a total of 144,199 source tokens before clitics are 
split, and 169,319 tree tokens after clitics are separated for the 
treebank annotation. Source texts were selected from Ummah Press news 
archives covering the period from July 2001 through September 2002.

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110823/942d4a3d/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora