<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body bgcolor="#ffffff" text="#000000">
<div align="center"> </div>
<p class="MsoNormal" align="center"> - <b><a href="#data">Fall 2011
LDC Data Scholarship Program</a></b> -</p>
<div align="center"> </div>
<div align="center"> </div>
<p class="MsoNormal" align="center"><i> New publications:</i></p>
<div align="center"> </div>
<p class="MsoNormal" align="center">- <b><a href="#rt">2005 Spring
NIST Rich Transcription (RT-05S) Evaluation Set</a></b>
-</p>
<div align="center"> </div>
<p class="MsoNormal" align="center">- <b><a href="#sre">2008 NIST
Speaker Recognition Evaluation Training Set Part 1</a></b>
-</p>
<div align="center"> </div>
<p class="MsoNormal" align="center">- <b><a href="#atb">Arabic
Treebank: Part 2 v 3.1</a></b> -</p>
<div class="MsoNormal" style="text-align: center;" align="center">
<hr width="100%" align="center" size="2"></div>
<p class="MsoNormal" align="center"> </p>
<div align="center"> </div>
<p class="MsoNormal" align="center"> <a name="data"></a><b>Fall 2011
LDC Data Scholarship Program</b></p>
<p class="MsoNormal"><br>
Applications are now being accepted through September 15, 2011 for
the Fall 2011 LDC Data Scholarship program! The LDC Data
Scholarship program provides university students with access to
LDC data at no-cost. During the previous two cycles of the
program, LDC has awarded no-cost copies of LDC data valued at over
US$25,000. <br>
<br>
This program is open to students pursuing both undergraduate and
graduate studies in an accredited college or university. LDC Data
Scholarships are not restricted to any particular field of study;
however, students must demonstrate a well-developed research
agenda and a bona fide inability to pay. The selection process is
highly competitive. <br>
<br>
The application consists of two parts: <br>
<br>
(1) <b>Data Use Proposal</b>. Applicants must submit a proposal
describing their intended use of the data. The proposal must
contain the applicant's name, university, and field of study. The
proposal should state which data the student plans to use and
contain a description of their research project. <br>
<br>
Applicants should consult the <a
href="http://www.ldc.upenn.edu/Catalog/index.jsp">LDC Corpus
Catalog</a> for a complete list of data distributed by LDC. Due
to certain restrictions, a handful of LDC corpora are restricted
to members of the Consortium. Applicants are advised to select a
maximum of one to two datasets; students may apply for additional
datasets during the following cycle once they have completed
processing of the initial datasets and publish or present work in
some juried venue.<br>
<br>
(2) <b>Letter of Support</b>. Applicants must submit one letter
of support from their thesis adviser or department chair. The
letter must confirm that the department or university lacks the
funding to pay the full Non-member Fee for the data and verify the
student's need for data.<br>
<br>
For further information on application materials and program
rules, please visit the <a
href="http://www.ldc.upenn.edu/About/scholarships.html">LDC Data
Scholarship</a> page. <br>
<br>
Students can email their applications to the <a
href="mailto:datascholarships@ldc.upenn.edu">LDC Data
Scholarship program</a>. Decisions will be sent by email from
the same address.<br>
<br>
The deadline for the Fall 2011 program cycle is September 15,
2011. <br>
</p>
<br>
<p class="MsoNormal" align="center"><b>New Publications</b></p>
<p class="MsoNormal"><a name="rt"></a>(1) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011S06">2005
Spring
NIST Rich Transcription (RT-05S) Conference Meeting Evaluation
Set</a> was developed by LDC and the National Institute of
Standards and Technology (NIST). It contains approximately 78
hours of English meeting speech, reference transcripts and other
material used in the <a
href="http://www.itl.nist.gov/iad/mig/tests/rt/2005-spring/index.html">RT
Spring
2005 evaluation</a>. Rich Transcription (RT) is broadly defined
as a fusion of speech-to-text (STT) technology and metadata
extraction technologies providing the bases for the generation of
more usable transcriptions of human-human speech in meetings.</p>
<p class="MsoNormal">RT-05S included the following tasks in the
meeting domain: </p>
<blockquote>
<p class="MsoNormal">Speech-To-Text (STT) - convert spoken words
into streams of text </p>
<p class="MsoNormal">Speaker Diarization (SPKR) - find the
segments of time within a meeting in which each meeting
participant is talking </p>
<p class="MsoNormal">Speech Activity Detection (SAD) - detect when
someone in a meeting space is talking </p>
</blockquote>
<p class="MsoNormal">Further information about the evaluation is
available on the <a
href="http://www.itl.nist.gov/iad/mig/tests/rt/2005-spring/index.html">RT-05
Spring
Evaluation Website</a>. </p>
<p class="MsoNormal">The data in this release consists of portions
of meeting speech collected between 2001 and 2005 by the IDIAP
Research Institute's Augmented Multi-Party Interaction project
(AMI), Martigny, Switzerland; International Computer Science
Institute (ICSI) at University of California, Berkeley;
Interactive Systems Laboratories (ISL) at Carnegie Mellon
University (CMU), Pittsburgh, PA; NIST; and Virginia Polytechnic
Institute and State University (VT), Blacksburg, VA. Each meeting
excerpt contains a head-mic recording for each subject and one or
more distant microphone recordings.</p>
<p class="MsoNormal">Reference transcripts for the evaluation
excerpts were prepared by LDC according to its Meeting Recording
Careful Transcription Guidelines. Those specifications are
designed to provide an accurate, verbatim (word-for-word)
transcription, time-aligned with the audio file and including the
identification of additional audio and speech signals with special
mark-up.<br>
</p>
<br>
<p class="MsoNormal" align="center"> *</p>
<p class="MsoNormal"> (2) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011S05">2008
NIST
Speaker Recognition Evaluation Training Set Part 1</a> was
developed by LDC and the National Institute of Standards and
Technology (NIST). It contains 640 hours of multilingual telephone
speech and English interview speech along with transcripts and
other materials used as training data in the <a
href="http://www.itl.nist.gov/iad/mig/tests/spk/2008/index.html">2008
NIST Speaker Recognition Evaluation (SRE)</a>. </p>
<p class="MsoNormal">SRE is part of an ongoing series of evaluations
conducted by NIST. These evaluations are an important contribution
to the direction of research efforts and the calibration of
technical capabilities. They are intended to be of interest to all
researchers working on the general problem of text independent
speaker recognition. </p>
<p class="MsoNormal">The 2008 evaluation was distinguished from
prior evaluations, in particular those in 2005 and 2006, by
including not only conversational telephone speech data but also
conversational speech data of comparable duration recorded over a
microphone channel involving an interview scenario.</p>
<p class="MsoNormal">The speech data in this release was collected
in 2007 by LDC at its <a
href="http://www.ldc.upenn.edu/About/facilities.shtml">Human
Subjects Data Collection Laboratories</a> in Philadelphia and by
the <a href="http://www.icsi.berkeley.edu/">International
Computer Science Institute</a> (ICSI) at the University of
California, Berkley. This collection was part of the <a
href="http://projects.ldc.upenn.edu/Mixer/">Mixer 5</a> project,
which was designed to support the development of robust speaker
recognition technology by providing carefully collected and
audited speech from a large pool of speakers recorded
simultaneously across numerous microphones and in different
communicative situations and/or in multiple languages. Mixer
participants were native English and bilingual English speakers.
The telephone speech in this corpus is predominately English; all
interview segments are in English. Telephone speech represents
approximately 565 hours of the data, where as microphone speech
represents the other 75 hours.</p>
<p class="MsoNormal">The telephone speech segments include excerpts
in the range of 8-12 seconds and 5 minutes from longer original
conversations. The interview material includes short conversation
interview segments of approximately 3 minutes from a longer
interview session. English language transcripts in .cfm format
were produced using an automatic speech recognition (ASR) system.</p>
<br>
<div align="center"> </div>
<p class="MsoNormal" align="center">*</p>
<p class="MsoNormal"><a name="atb"></a>(3) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011T09">Arabic
Treebank:
Part 2 (ATB2) v 3.1</a> was developed at LDC. It consists of 501
newswire stories from Ummah Press with part-of-speech (POS),
morphology, gloss and syntactic treebank annotation in accordance
with the <a href="http://projects.ldc.upenn.edu/ArabicTreebank/">Penn
Arabic Treebank (PATB) Guidelines</a> developed in 2008 and
2009. This release represents a significant revision of LDC's
previous ATB2 publication: <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T02">Arabic
Treebank:
Part 2 v 2.0 LDC2004T02</a>. </p>
<p class="MsoNormal">The ongoing PATB project supports research in
Arabic-language natural language processing and human language
technology development. The methodology and work leading to the
release of this publication are described in detail in the
documentation accompanying this corpus and in two research papers:
<a
href="http://papers.ldc.upenn.edu/LREC2008/Enhancing_Arabic_Treebank.pdf">Enhancing
the
Arabic Treebank: A Collaborative Effort toward New Annotation
Guidelines</a> and <a
href="http://papers.ldc.upenn.edu/LREC2010/KulickBiesMaamouri-LREC2010.pdf">Consistent
and
Flexible Integration of Morphological Annotation in the Arabic
Treebank</a>. </p>
<p class="MsoNormal">ATB2 v 3.1 contains a total of 144,199 source
tokens before clitics are split, and 169,319 tree tokens after
clitics are separated for the treebank annotation. Source texts
were selected from Ummah Press news archives covering the period
from July 2001 through September 2002. </p>
<br>
<hr width="100%" size="2"><br>
<pre class="moz-signature" cols="72">Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810 <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a>
</pre>
<pre class="moz-signature" cols="72">
</pre>
</body>
</html>