<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

  </head>

  <body bgcolor="#ffffff" text="#000000">

    <div align="center"> </div>

    <p class="MsoNormal" align="center"> - <b><a href="#data">Fall 2011

          LDC Data Scholarship Program</a></b>  -</p>

    <div align="center"> </div>

    <div align="center"> </div>

    <p class="MsoNormal" align="center"><i> New publications:</i></p>

    <div align="center"> </div>

    <p class="MsoNormal" align="center">-  <b><a href="#rt">2005 Spring

          NIST Rich Transcription (RT-05S) Evaluation Set</a></b> 

      -</p>

    <div align="center"> </div>

    <p class="MsoNormal" align="center">-  <b><a href="#sre">2008 NIST

          Speaker Recognition Evaluation Training Set Part 1</a></b> 

      -</p>

    <div align="center"> </div>

    <p class="MsoNormal" align="center">-  <b><a href="#atb">Arabic

          Treebank: Part 2 v 3.1</a></b>  -</p>

    <div class="MsoNormal" style="text-align: center;" align="center">

      <hr width="100%" align="center" size="2"></div>

    <p class="MsoNormal" align="center"> </p>

    <div align="center"> </div>

    <p class="MsoNormal" align="center"> <a name="data"></a><b>Fall 2011

        LDC Data Scholarship Program</b></p>

    <p class="MsoNormal"><br>

      Applications are now being accepted through September 15, 2011 for

      the Fall 2011 LDC Data Scholarship program!  The LDC Data

      Scholarship program provides university students with access to

      LDC data at no-cost.  During the previous two cycles of the

      program, LDC has awarded no-cost copies of LDC data valued at over

      US$25,000.  <br>

      <br>

      This program is open to students pursuing both undergraduate and

      graduate studies in an accredited college or university. LDC Data

      Scholarships are not restricted to any particular field of study;

      however, students must demonstrate a well-developed research

      agenda and a bona fide inability to pay. The selection process is

      highly competitive.  <br>

      <br>

      The application consists of two parts: <br>

      <br>

      (1)  <b>Data Use Proposal</b>. Applicants must submit a proposal

      describing their intended use of the data. The proposal must

      contain the applicant's name, university, and field of study. The

      proposal should state which data the student plans to use and

      contain a description of their research project.  <br>

      <br>

      Applicants should consult the <a

        href="http://www.ldc.upenn.edu/Catalog/index.jsp">LDC Corpus

        Catalog</a> for a complete list of data distributed by LDC.  Due

      to certain restrictions, a handful of LDC corpora are restricted

      to members of the Consortium.  Applicants are advised to select a

      maximum of one to two datasets; students may apply for additional

      datasets during the following cycle once they have completed

      processing of the initial datasets and publish or present work in

      some juried venue.<br>

      <br>

      (2) <b>Letter of Support</b>. Applicants must submit one letter

      of support from their thesis adviser or department chair. The

      letter must confirm that the department or university lacks the

      funding to pay the full Non-member Fee for the data and verify the

      student's need for data.<br>

      <br>

      For further information on application materials and program

      rules, please visit the <a

        href="http://www.ldc.upenn.edu/About/scholarships.html">LDC Data

        Scholarship</a> page.  <br>

      <br>

      Students can email their applications to the <a

        href="mailto:datascholarships@ldc.upenn.edu">LDC Data

        Scholarship program</a>. Decisions will be sent by email from

      the same address.<br>

      <br>

      The deadline for the Fall 2011 program cycle is September 15,

      2011. <br>

    </p>

    <br>

    <p class="MsoNormal" align="center"><b>New Publications</b></p>

    <p class="MsoNormal"><a name="rt"></a>(1) <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011S06">2005

Spring

        NIST Rich Transcription (RT-05S) Conference Meeting Evaluation

        Set</a> was developed by LDC and the National Institute of

      Standards and Technology (NIST). It contains approximately 78

      hours of English meeting speech, reference transcripts and other

      material used in the <a

        href="http://www.itl.nist.gov/iad/mig/tests/rt/2005-spring/index.html">RT

Spring

        2005 evaluation</a>. Rich Transcription (RT) is broadly defined

      as a fusion of speech-to-text (STT) technology and metadata

      extraction technologies providing the bases for the generation of

      more usable transcriptions of human-human speech in meetings.</p>

    <p class="MsoNormal">RT-05S included the following tasks in the

      meeting domain: </p>

    <blockquote>

      <p class="MsoNormal">Speech-To-Text (STT) - convert spoken words

        into streams of text </p>

      <p class="MsoNormal">Speaker Diarization (SPKR) - find the

        segments of time within a meeting in which each meeting

        participant is talking </p>

      <p class="MsoNormal">Speech Activity Detection (SAD) - detect when

        someone in a meeting space is talking </p>

    </blockquote>

    <p class="MsoNormal">Further information about the evaluation is

      available on the <a

        href="http://www.itl.nist.gov/iad/mig/tests/rt/2005-spring/index.html">RT-05

Spring

        Evaluation Website</a>. </p>

    <p class="MsoNormal">The data in this release consists of portions

      of meeting speech collected between 2001 and 2005 by the IDIAP

      Research Institute's Augmented Multi-Party Interaction project

      (AMI), Martigny, Switzerland; International Computer Science

      Institute (ICSI) at University of California, Berkeley;

      Interactive Systems Laboratories (ISL) at Carnegie Mellon

      University (CMU), Pittsburgh, PA; NIST; and Virginia Polytechnic

      Institute and State University (VT), Blacksburg, VA. Each meeting

      excerpt contains a head-mic recording for each subject and one or

      more distant microphone recordings.</p>

    <p class="MsoNormal">Reference transcripts for the evaluation

      excerpts were prepared by LDC according to its Meeting Recording

      Careful Transcription Guidelines. Those specifications are

      designed to provide an accurate, verbatim (word-for-word)

      transcription, time-aligned with the audio file and including the

      identification of additional audio and speech signals with special

      mark-up.<br>

    </p>

    <br>

    <p class="MsoNormal" align="center"> *</p>

    <p class="MsoNormal"> (2) <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011S05">2008

NIST

        Speaker Recognition Evaluation Training Set Part 1</a> was

      developed by LDC and the National Institute of Standards and

      Technology (NIST). It contains 640 hours of multilingual telephone

      speech and English interview speech along with transcripts and

      other materials used as training data in the <a

        href="http://www.itl.nist.gov/iad/mig/tests/spk/2008/index.html">2008

        NIST Speaker Recognition Evaluation (SRE)</a>. </p>

    <p class="MsoNormal">SRE is part of an ongoing series of evaluations

      conducted by NIST. These evaluations are an important contribution

      to the direction of research efforts and the calibration of

      technical capabilities. They are intended to be of interest to all

      researchers working on the general problem of text independent

      speaker recognition. </p>

    <p class="MsoNormal">The 2008 evaluation was distinguished from

      prior evaluations, in particular those in 2005 and 2006, by

      including not only conversational telephone speech data but also

      conversational speech data of comparable duration recorded over a

      microphone channel involving an interview scenario.</p>

    <p class="MsoNormal">The speech data in this release was collected

      in 2007 by LDC at its <a

        href="http://www.ldc.upenn.edu/About/facilities.shtml">Human

        Subjects Data Collection Laboratories</a> in Philadelphia and by

      the <a href="http://www.icsi.berkeley.edu/">International

        Computer Science Institute</a> (ICSI) at the University of

      California, Berkley. This collection was part of the <a

        href="http://projects.ldc.upenn.edu/Mixer/">Mixer 5</a> project,

      which was designed to support the development of robust speaker

      recognition technology by providing carefully collected and

      audited speech from a large pool of speakers recorded

      simultaneously across numerous microphones and in different

      communicative situations and/or in multiple languages. Mixer

      participants were native English and bilingual English speakers.

      The telephone speech in this corpus is predominately English; all

      interview segments are in English. Telephone speech represents

      approximately 565 hours of the data, where as microphone speech

      represents the other 75 hours.</p>

    <p class="MsoNormal">The telephone speech segments include excerpts

      in the range of 8-12 seconds and 5 minutes from longer original

      conversations. The interview material includes short conversation

      interview segments of approximately 3 minutes from a longer

      interview session. English language transcripts in .cfm format

      were produced using an automatic speech recognition (ASR) system.</p>

    <br>

    <div align="center"> </div>

    <p class="MsoNormal" align="center">*</p>

    <p class="MsoNormal"><a name="atb"></a>(3) <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011T09">Arabic

Treebank:

        Part 2 (ATB2) v 3.1</a> was developed at LDC. It consists of 501

      newswire stories from Ummah Press with part-of-speech (POS),

      morphology, gloss and syntactic treebank annotation in accordance

      with the <a href="http://projects.ldc.upenn.edu/ArabicTreebank/">Penn

        Arabic Treebank (PATB) Guidelines</a> developed in 2008 and

      2009. This release represents a significant revision of LDC's

      previous ATB2 publication: <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T02">Arabic

Treebank:

        Part 2 v 2.0 LDC2004T02</a>. </p>

    <p class="MsoNormal">The ongoing PATB project supports research in

      Arabic-language natural language processing and human language

      technology development. The methodology and work leading to the

      release of this publication are described in detail in the

      documentation accompanying this corpus and in two research papers:

      <a

href="http://papers.ldc.upenn.edu/LREC2008/Enhancing_Arabic_Treebank.pdf">Enhancing

the

        Arabic Treebank: A Collaborative Effort toward New Annotation

        Guidelines</a> and <a

href="http://papers.ldc.upenn.edu/LREC2010/KulickBiesMaamouri-LREC2010.pdf">Consistent

and

        Flexible Integration of Morphological Annotation in the Arabic

        Treebank</a>. </p>

    <p class="MsoNormal">ATB2 v 3.1 contains a total of 144,199 source

      tokens before clitics are split, and 169,319 tree tokens after

      clitics are separated for the treebank annotation. Source texts

      were selected from Ummah Press news archives covering the period

      from July 2001 through September 2002. </p>

    <br>

    <hr width="100%" size="2"><br>

    <pre class="moz-signature" cols="72">Ilya Ahtaridis

Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                  Phone: 1 (215) 573-1275

University of Pennsylvania                    Fax: 1 (215) 573-2175

3600 Market St., Suite 810                        <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                 <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a>

</pre>

    <pre class="moz-signature" cols="72">

</pre>

  </body>

</html>