<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

  <head>


    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

  </head>

  <body text="#000000" bgcolor="#ffffff">

    <div class="moz-text-html" lang="x-western">

      <div class="moz-text-html" lang="x-western">

        <div align="center"> </div>

        <p class="MsoNormal" align="center"><a href="#scholar"><b>  Fall

              2011 LDC Data Scholarships recipients <br>

            </b></a></p>

        <div align="center"> </div>

        <div align="center"><i>New publications:</i></div>

        <div align="center"> </div>

        <p class="MsoNormal" align="center">LDC2011S08 <br>

          <a href="#sre"><b>2008 NIST Speaker Recognition Evaluation

              Test Set </b></a><br>

        </p>

        <div align="center"> </div>

        <p class="MsoNormal" align="center">LDC2011T11 <br>

          <a href="#argig"><b>Arabic Gigaword Fifth Edition </b></a> </p>

        <div align="center"> </div>

        <p class="MsoNormal" align="center">LDC2011T12 <br>

          <a href="#sp"><b>Spanish Gigaword Third Edition</b></a></p>

        <div class="MsoNormal" style="text-align: center;"

          align="center">

          <hr width="100%" align="center" size="2"></div>

        <p class="MsoNormal" align="center"><b><br>

          </b><a name="scholar"></a><b>Fall 2011 LDC Data Scholarships

            recipients</b></p>

        <p class="MsoNormal">LDC is pleased to announce the student

          recipients of the Fall 2011 LDC Data Scholarship program!  The

          LDC Data Scholarship program provides university students with

          access to LDC data at no-cost.  Data scholarships are offered

          twice a year to correspond to the Fall and Spring semesters. 

          Students are asked to complete an application which consists

          of a data use proposal and letter of support from their

          academic adviser.   <br>

          <br>

          LDC received many strong applications from students attending

          universities across the globe.  We've reviewed all the

          applications, and after careful consideration, we have

          selected four scholarship recipients!   These students will

          receive no-cost copies of LDC data:</p>

        <blockquote>

          <p class="MsoNormal">Haris B C - Indian Institute of

            Technology Guwahati (India), Electronics & Electrical

            Engineering.  Haris has been awarded a copy of 2005 NIST

            Speaker Recognition Evaluation Training Data (LDC2011S01)

            and 2005 NIST Speaker Recognition Evaluation Test Data

            (LDC2011S04) to evaluate the performance of a sparse

            representation speaker verification system. <br>

            <br>

            Friðjón Guðjohnsen - Reykjavik University (Iceland),

            Computer Science.  Friðjón has been awarded a copy of

            Treebank-3 (LDC99T42) to be used in the development of

            tagging methods to improve the accuracy of tagging Icelandic

            texts.<br>

            <br>

            Leili Javadpour - Louisiana State University (USA),

            Engineering Science.  Leili has been awarded a copy of BBN

            Pronoun Coreference and Entity Type Corpus (LDC2005T33) and

            Message Understanding Conference (MUC) 7 (LDC2001T02) for

            her work in pronominal anaphora resolution.<br>

            <br>

            Jad Makhlouta - American University of Beirut (Lebanon),

            Electrical and Computer Engineering.  Jad has been awarded a

            copy of LDC Standard Arabic Morphological Analyzer (SAMA)

            Version 3.1 (LDC2010L01) for his work in Arabic text mining.</p>

        </blockquote>

        <p class="MsoNormal"> Please join us in congratulating our

          student recipients!  <a style=""> Look for our upcoming

            announcements about the submissions deadlines for the Spring

            2012 <span style=""></span>LDC Data Scholarship program</a><span

            style=""></span>. </p>

        <b></b><br>

        <p class="MsoNormal"> <br>

        </p>

        <div align="center"> <b>New publications</b></div>

        <p class="MsoNormal"><a name="sre"></a>(1) <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011S08">2008


            NIST Speaker Recognition Evaluation Test Set</a> was

          developed by LDC and NIST (National Institute of Standards and

          Technology). It contains 942 hours of multilingual telephone

          speech and English interview speech along with transcripts and

          other materials used as test data in the <a

            href="http://www.itl.nist.gov/iad/mig/tests/spk/2008/index.html">2008


            NIST Speaker Recognition Evaluation (SRE)</a>. </p>

        <p class="MsoNormal">NIST SRE is part of an ongoing series of

          evaluations conducted by NIST.  They are intended to be of

          interest to all researchers working on the general problem of

          text independent speaker recognition. The 2008 evaluation was

          distinguished from prior evaluations, in particular those in

          2005 and 2006, by including not only conversational telephone

          speech data but also conversational speech data of comparable

          duration recorded over a microphone channel involving an

          interview scenario.</p>

        <p class="MsoNormal">LDC previously released the 2008 NIST SRE

          Training Set in two parts as <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011S05">LDC2011S05</a>

          and <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011S07">LDC2011S07</a>.</p>

        <p class="MsoNormal">The speech data in this release was

          collected in 2007 by LDC at its <a

            href="http://www.ldc.upenn.edu/About/facilities.shtml">Human

            Subjects Data Collection Laboratories</a> in Philadelphia

          and by the <a href="http://www.icsi.berkeley.edu/">International


            Computer Science Institute</a> (ICSI) at the University of

          California, Berkeley. This collection was part of the <a

            href="http://projects.ldc.upenn.edu/Mixer/">Mixer 5</a>

          project, which was designed to support the development of

          robust speaker recognition technology by providing carefully

          collected and audited speech from a large pool of speakers

          recorded simultaneously across numerous microphones and in

          different communicative situations and/or in multiple

          languages. Mixer participants were native English and

          bilingual English speakers. The telephone speech in this

          corpus is predominantly English, but also includes the above

          languages. All interview segments are in English. Telephone

          speech represents approximately 368 hours of the data, whereas

          microphone speech represents the other 574 hours. </p>

        <p class="MsoNormal">English language transcripts in .cfm format

          were produced using an automatic speech recognition (ASR)

          system.</p>

        <p class="MsoNormal"><br>

          <br>

        </p>

        <p class="MsoNormal" align="center">*</p>

        <p class="MsoNormal"><a name="argig"></a>(2) <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011T11">Arabic


            Gigaword Fifth Edition</a> is a comprehensive archive of

          newswire text data that has been acquired from Arabic news

          sources over several years by LDC. Arabic Gigaword Fifth

          Edition includes all of the content of the fourth edition of

          Arabic Gigaword (<a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T30">LDC2009T30</a>)

          plus new data covering the period from January 1, 2009 through

          December 31, 2010.</p>

        <p class="MsoNormal">Nine distinct sources of Arabic newswire

          are represented in this distribution:<br>

        </p>

        <blockquote>

          <p class="MsoNormal">Asharq Al-Awsat (aaw_arb)</p>

          <p class="MsoNormal">Agence France Presse (afp_arb)</p>

          <p class="MsoNormal">Al-Ahram (ahr_arb)</p>

          <p class="MsoNormal">Assabah (asb_arb)</p>

          <p class="MsoNormal">Al Hayat (hyt_arb)</p>

          <p class="MsoNormal">An Nahar (nhr_arb)</p>

          <p class="MsoNormal">Al-Quds Al-Arabi (qds_arb)</p>

          <p class="MsoNormal">Ummah Press (umh_arb)</p>

          <p class="MsoNormal">Xinhua News Agency (xin_arb)</p>

        </blockquote>

        <p class="MsoNormal">The seven-character codes shown above

          represent both the directory names where the data files are

          found, and the 7-letter prefix that appears at the beginning

          of every file name. The 7-letter codes consist of the

          three-character source name IDs and the three-character

          language code ("arb") separated by an underscore ("_")

          character. The three-character language code conforms to the <a

            href="http://www.sil.org/iso639-3/default.asp">ISO 639-3</a>

          standard.</p>

        <p class="MsoNormal">In addition to adding new data, the

          following updates were made:</p>

        <blockquote>

          <p class="MsoNormal">Repeated documents in Asharq Al-Awsat

            data from 2008 were removed.</p>

          <p class="MsoNormal">Document formatting and docid duplication

            problems were corrected in Agence France Presse <span

              style=""> </span>data.</p>

          <p class="MsoNormal">Significant duplication of content in

            2007-2008 An Nahar data was detected, and the duplicated

            documents were removed.</p>

        </blockquote>

        <br>

        <p class="MsoNormal"> <br>

        </p>

        <p class="MsoNormal" align="center">*</p>

        <p class="MsoNormal"><a name="sp"></a>(3) <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011T12">Spanish


            Gigaword Third Edition</a> was produced by LDC. It is a

          comprehensive archive of Spanish newswire text data that has

          been acquired over several years by LDC. Spanish Gigaword

          Third Edition includes all of the content of the second

          edition (<a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T21">LDC2009T21</a>)

          and adds data collected from January 1, 2009 through December

          31, 2010.</p>

        <p class="MsoNormal">The three distinct international sources of

          Spanish newswire in this edition, and the time spans of

          collection covered for each, are as follows:</p>

        <blockquote>

          <p class="MsoNormal">Agence France-Presse, Spanish (afp_spa)

            May 1994 - Dec 2010</p>

          <p class="MsoNormal">Associated Press, Spanish (apw_spa) Nov

            1993 - Dec 2010</p>

          <p class="MsoNormal">Xinhua News Agency, Spanish (xin_spa) Sep

            2001 - Dec 2010</p>

        </blockquote>

        <p class="MsoNormal">The seven-letter codes in the parentheses

          above include the three-character source name abbreviations

          and the three-character language code ("spa") separated by an

          underscore ("_") character. The three-letter language code

          conforms to LDC's internal convention based on the <a

            href="http://www.sil.org/iso639-3/default.asp">ISO 639-3</a>

          standard.</p>

        <p class="MsoNormal">All text data are presented in SGML/XML

          form, using a very simple, minimal markup structure; all text

          consists of printable ASCII, whitespace, and printable code

          points in the "Latin1 Supplement" character table, as defined

          by both ISO-8859-1 and the Unicode Standard (ISO 10646) for

          the "accented" characters used in Spanish. The

          Supplement/accented characters are rendered using UTF-8

          encoding.</p>

        <br>

        <hr width="100%" size="2"><span style=""></span><br>

      </div>

      <br>

      <pre class="moz-signature" cols="72">Ilya Ahtaridis

Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                  Phone: 1 (215) 573-1275

University of Pennsylvania                    Fax: 1 (215) 573-2175

3600 Market St., Suite 810                        <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                 <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a>


</pre>

    </div>

    <pre class="moz-signature" cols="72">


</pre>

  </body>

</html>