<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p class="MsoNormal"><b>-  <a href="#scholar">Fall 2014 Data

          Scholarship Program</a>  -</b><o:p></o:p></p>

    <p class="MsoNormal"><i>New publications:</i><o:p></o:p></p>

    <p class="MsoNormal"><b>-  <a href="#lre">2009 NIST Language

          Recognition Evaluation Test Set</a>  -</b><o:p></o:p></p>

    <p class="MsoNormal"><b>-  <a href="#gale">GALE Arabic-English Word

          Alignment Training Part 3 -- Web</a>  -</b><o:p></o:p></p>

    <p class="MsoNormal"><b>-  <a href="#g2">GALE Phase 2 Chinese

          Newswire Parallel Text Part 1</a>  -</b></p>

    <o:p></o:p>

    <div class="MsoNormal" style="text-align:center" align="center">

      <hr size="2" width="100%" align="center"> </div>

    <p class="MsoNormal"><a name="scholar"></a><b>Fall 2014 Data

        Scholarship Program</b><o:p></o:p></p>

    <p class="MsoNormal">Applications are now being accepted through

      Monday, September 15, 2014, 11:59PM EST for the Fall 2014 LDC Data

      Scholarship program! The LDC Data Scholarship program provides

      university students with access to LDC data at no-cost.<br>

      <br>

      This program is open to students pursuing both undergraduate and

      graduate studies in an accredited college or university. LDC Data

      Scholarships are not restricted to any particular field of study;

      however, students must demonstrate a well-developed research

      agenda and a bona fide inability to pay. The selection process is

      highly competitive.<br>

      <br>

      The application consists of two parts:<br>

      <br>

      (1) Data Use Proposal. Applicants must submit a proposal

      describing their intended use of the data. The proposal should

      state which data the student plans to use and how the data will

      benefit their research project as well as information on the

      proposed methodology or algorithm.<br>

      <br>

      Applicants should consult the <a

        href="https://catalog.ldc.upenn.edu/"><span style="color:blue">LDC

          Catalog</span></a> for a complete list of data distributed by

      LDC. Due to certain restrictions, a handful of LDC corpora are

      restricted to members of the Consortium. Applicants are advised to

      select a maximum of one to two databases.<br>

      <br>

      (2) Letter of Support. Applicants must submit one letter of

      support from their thesis adviser or department chair. The letter

      must confirm that the department or university lacks the funding

      to pay the full non-member fee for the data and verify the

      student's need for data.<br>

      <br>

      For further information on application materials and program

      rules, please visit the <a

href="https://www.ldc.upenn.edu/language-resources/data/data-scholarships"><span

          style="color:blue">LDC Data Scholarship</span></a> page.<br>

    </p>

    <p class="MsoNormal"><br>

      <o:p></o:p></p>

    <p class="MsoNormal"><b>New publications<br>

      </b></p>

    <p class="MsoNormal"><a name="lre"></a>(1)<a

        href="https://catalog.ldc.upenn.edu/LDC2014S06"><span

          style="color:blue"> 2009 NIST Language Recognition Evaluation

          Test Set</span></a> contains approximately 215 hours of

      conversational telephone speech and radio broadcast conversation

      collected by LDC in the following 23 languages and dialects:

      Amharic, Bosnian, Cantonese, Creole (Haitian), Croatian, Dari,

      English (American), English (Indian), Farsi, French, Georgian,

      Hausa, Hindi, Korean, Mandarin, Pashto, Portuguese, Russian,

      Spanish, Turkish, Ukrainian, Urdu and Vietnamese.<o:p></o:p></p>

    <p class="MsoNormal">The goal of the <a

        href="http://www.itl.nist.gov/iad/"><span style="color:blue">NIST

          (National Institute of Standards and Technology)</span></a> <a

        href="http://www.itl.nist.gov/iad/mig/tests/lre/"><span

          style="color:blue">Language Recognition Evaluation (LRE)</span></a>

      is to establish the baseline of current performance capability for

      language recognition of conversational telephone speech and to lay

      the groundwork for further research efforts in the field. NIST

      conducted language recognition evaluations in <a

        href="http://www.itl.nist.gov/iad/mig/tests/lre/1996/"><span

          style="color:blue">1996</span></a>, <a

        href="http://www.itl.nist.gov/iad/mig/tests/lre/2003/"><span

          style="color:blue">2003</span></a>, <a

        href="http://www.itl.nist.gov/iad/mig/tests/lre/2005/"><span

          style="color:blue">2005</span></a> and <a

        href="http://www.itl.nist.gov/iad/mig/tests/lre/2007/"><span

          style="color:blue">2007</span></a>. The <a

        href="http://www.itl.nist.gov/iad/mig/tests/lre/2009/"><span

          style="color:blue">2009</span></a> evaluation increased the

      number of target languages. Most of the test data originated from

      multilingual Voice of America (VOA) radio broadcasts assessed as

      being of telephone bandwidth in addition to conversational

      telephone speech. Further information regarding this evaluation

      can be found in the evaluation plan which is included in the

      documentation for this release.<o:p></o:p></p>

    <p class="MsoNormal">LDC released the prior LREs as:<o:p></o:p></p>

    <blockquote>

      <p class="MsoNormal">2003 NIST Language Recognition Evaluation (<a

          href="https://catalog.ldc.upenn.edu/LDC2006S31"><span

            style="color:blue">LDC2006S31</span></a>)<o:p></o:p></p>

      <p class="MsoNormal">2005 NIST Language Recognition Evaluation (<a

          href="https://catalog.ldc.upenn.edu/LDC2008S05"><span

            style="color:blue">LDC2008S05</span></a>)<o:p></o:p></p>

      <p class="MsoNormal">2007 NIST Language Recognition Evaluation

        Test Set (<a href="https://catalog.ldc.upenn.edu/LDC2009S04"><span

            style="color:blue">LDC2009S04</span></a>)<o:p></o:p></p>

      <p class="MsoNormal">2007 NIST Language Recognition Evaluation

        Supplemental Training Set (<a

          href="https://catalog.ldc.upenn.edu/LDC2009S05"><span

            style="color:blue">LDC2009S05</span></a>)<o:p></o:p></p>

    </blockquote>

    <p class="MsoNormal">The VOA speech data was collected by LDC in

      2000 and 2001 and constitutes approximately 75% of the test set.

      The telephone speech was taken from LDC's Mixer 3 collection

      recorded between 2005 and 2007.<o:p></o:p></p>

    <p class="MsoNormal">All test speech segments are presented as a

      sampled data stream in standard 8-bit 8-kHz μ-law format. Each

      segment is stored separately in a single channel SPHERE format

      file. The test segments contain three nominal durations of speech:

      3 seconds, 10 seconds and 30 seconds. Actual speech durations

      vary, but were constrained to be within the ranges of 2-4 seconds,

      7-13 seconds and 23-35 seconds, respectively. <o:p></o:p></p>

    <br>

    <span class="MsoCommentReference"><span

        style="font-size:8.0pt;line-height:115%"><span

          style="mso-special-character:comment"></span></span></span><o:p></o:p>

    <p class="MsoNormal" align="center">*<o:p></o:p></p>

    <p class="MsoNormal"><a name="gale"></a>(2) <a

        href="https://catalog.ldc.upenn.edu/LDC2014T14"><span

          style="color:blue">GALE Arabic-English Word Alignment Training

          Part 3 -- Web</span></a> was developed by LDC and contains

      217,158 tokens of word aligned Arabic and English parallel text

      enriched with linguistic tags. This material was used as training

      data in the DARPA GALE (Global Autonomous Language Exploitation)

      program. <o:p></o:p></p>

    <p class="MsoNormal">Some approaches to statistical machine

      translation include the incorporation of linguistic knowledge in

      word aligned text as a means to improve automatic word alignment

      and machine translation quality. This is accomplished with two

      annotation schemes: alignment and tagging. Alignment identifies

      minimum translation units and translation relations by using

      minimum-match and attachment annotation approaches. A set of word

      tags and alignment link tags are designed in the tagging scheme to

      describe these translation units and relations. Tagging adds

      contextual, syntactic and language-specific features to the

      alignment annotation.<o:p></o:p></p>

    <p class="MsoNormal">Other releases available in this series are:<o:p></o:p></p>

    <blockquote>

      <p class="MsoNormal">GALE Chinese-English Word Alignment and

        Tagging Training Part 1 -- Newswire and Web (<a

          href="http://catalog.ldc.upenn.edu/LDC2012T16"><span

            style="color:blue">LDC2012T16</span></a>)<o:p></o:p></p>

      <p class="MsoNormal">GALE Chinese-English Word Alignment and

        Tagging Training Part 2 -- Newswire (<a

          href="http://catalog.ldc.upenn.edu/LDC2012T20"><span

            style="color:blue">LDC2012T20</span></a>)<o:p></o:p></p>

      <p class="MsoNormal">GALE Chinese-English Word Alignment and

        Tagging Training Part 3 -- Web (<a

          href="http://catalog.ldc.upenn.edu/LDC2012T24"><span

            style="color:blue">LDC2012T24</span></a>)<o:p></o:p></p>

      <p class="MsoNormal">GALE Chinese-English Word Alignment and

        Tagging Training Part 4 -- Web (<a

          href="http://catalog.ldc.upenn.edu/LDC2013T05"><span

            style="color:blue">LDC2013T05</span></a>)<o:p></o:p></p>

      <p class="MsoNormal">GALE Chinese-English Word Alignment and

        Tagging -- Broadcast Training Part 1 (<a

          href="http://catalog.ldc.upenn.edu/LDC2013T23"><span

            style="color:blue">LDC2013T23</span></a>)<o:p></o:p></p>

      <p class="MsoNormal">GALE Arabic-English Word Alignment Training

        Part 1 -- Newswire and Web (<a

          href="http://catalog.ldc.upenn.edu/LDC2014T05"><span

            style="color:blue">LDC2014T05</span></a>)<o:p></o:p></p>

      <p class="MsoNormal">GALE Arabic-English Word Alignment Training

        Part 2 -- Newswire (<a

          href="http://catalog.ldc.upenn.edu/LDC2014T10"><span

            style="color:blue">LDC2014T10</span></a>)<o:p></o:p></p>

    </blockquote>

    <p class="MsoNormal">This release consists of Arabic source web data

      collected by LDC. The distribution by genre, words, character

      tokens and segments appears below:<o:p></o:p></p>

    <table class="MsoNormalTable" style="mso-cellspacing:1.5pt;

      mso-yfti-tbllook:1184" border="1" cellpadding="0">

      <tbody>

        <tr style="mso-yfti-irow:0;mso-yfti-firstrow:yes">

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Language<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Genre<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Files<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Words<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">CharTokens<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Segments<o:p></o:p></p>

          </td>

        </tr>

        <tr style="mso-yfti-irow:1;mso-yfti-lastrow:yes">

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Arabic<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">WB<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">2,449<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">154,144<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">217,158<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">7,332<o:p></o:p></p>

          </td>

        </tr>

      </tbody>

    </table>

    <p class="MsoNormal">Note that word count is based on the

      untokenized Arabic source, and token count is based on the

      tokenized Arabic source.<o:p></o:p></p>

    <p class="MsoNormal">The Arabic word alignment tasks consisted of

      the following components:<o:p></o:p></p>

    <blockquote>

      <p class="MsoNormal">Normalizing tokenized tokens as needed<o:p></o:p></p>

      <p class="MsoNormal">Identifying different types of links<o:p></o:p></p>

      <p class="MsoNormal">Identifying sentence segments not suitable

        for annotation<o:p></o:p></p>

      <p class="MsoNormal">Tagging unmatched words attached to other

        words or phrases<o:p></o:p></p>

    </blockquote>

    <br>

    <o:p></o:p>

    <p class="MsoNormal" align="center">*<o:p></o:p></p>

    <p class="MsoNormal"><a name="g2"></a>(3) <a

        href="https://catalog.ldc.upenn.edu/LDC2014T15"><span

          style="color:blue">GALE Phase 2 Chinese Newswire Parallel Text

          Part 1</span></a> was developed by LDC. Along with other

      corpora, the parallel text in this release comprised training data

      for Phase 2 of the DARPA GALE (Global Autonomous Language

      Exploitation) Program. This corpus contains 117,173 tokens of

      Chinese source text and corresponding English translations

      selected from newswire data collected by LDC in 2007 and

      transcribed by LDC or under its direction.<o:p></o:p></p>

    <p class="MsoNormal">This release includes 167 source-translation

      document pairs, comprising 117,173 tokens of translated data. Data

      is drawn from four distinct Chinese newswire sources: China News

      Service, Guangming Daily, People's Daily and People's Liberation

      Army Daily.<o:p></o:p></p>

    <p class="MsoNormal">The data was transcribed by LDC staff and/or

      transcription vendors under contract to LDC in accordance with

      Quick Rich Transcription guidelines developed by LDC. Transcribers

      indicated sentence boundaries in addition to transcribing the

      text. Data was manually selected for translation according to

      several criteria, including linguistic features, transcription

      features and topic features. The transcribed and segmented files

      were then reformatted into a human-readable translation format and

      assigned to translation vendors. Translators followed LDC's

      Chinese to English translation guidelines. Bilingual LDC staff

      performed quality control procedures on the completed

      translations.<o:p></o:p></p>

    <p class="MsoNormal">Source data and translations are distributed in

      TDF format. TDF files are tab-delimited files containing one

      segment of text along with meta information about that segment.

      Each field in the TDF file is described in TDF_format.text. All

      data are encoded in UTF-8.<o:p></o:p></p>

    <br>

    <br>

    <hr size="2" width="100%"><br>

    <pre class="moz-signature" cols="72">-- 

--

Ilya Ahtaridis

Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                  Phone: 1 (215) 573-1275

University of Pennsylvania                    Fax: 1 (215) 573-2175

3600 Market St., Suite 810                        <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                 <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></pre>

    <pre class="moz-signature" cols="72">

</pre>

  </body>

</html>