<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p class="MsoNormal"><a href="#scholar"><b>Spring 2014 LDC Data

          Scholarship Recipients</b><b><o:p></o:p></b></a></p>

    <a href="#scholar"> <b> </b> </a>

    <p class="MsoNormal"><a href="#member"><b>2014 Publications Pipeline</b><b><o:p></o:p></b></a></p>

    <a href="#member"> <b> </b> </a>

    <p class="MsoNormal"><i>New publications:</i><o:p></o:p></p>

    <p class="MsoNormal"><a href="#gale"><b>GALE Arabic-English Parallel

          Aligned Treebank -- Broadcast News Part 2</b><b><o:p></o:p></b></a></p>

    <a href="#gale"> <b> </b> </a>

    <p class="MsoNormal"><a href="#saud"><b>King Saud University Arabic

          Speech Database</b><b><o:p></o:p></b></a></p>

    <a href="#saud"> <b> </b> </a>

    <p class="MsoNormal"><a href="#openmt"><b>NIST 2012 Open Machine

          Translation (OpenMT) Progress Test Five Language Source</b></a></p>

    <a href="#openmt">

    </a><o:p></o:p>

    <div class="MsoNormal" style="text-align:center" align="center">

      <hr size="2" width="100%" align="center"></div>

    <div class="MsoNormal" style="text-align:center" align="center">

      <hr size="2" width="100%" align="center"></div>

    <hr size="2" width="100%" align="center"><b> </b> <b> </b>

    <p class="MsoNormal"><a name="scholar"></a><b>Spring 2014 LDC Data

        Scholarship Recipients</b><o:p></o:p></p>

    <p class="MsoNormal">LDC is pleased to announce the student

      recipients of the Spring 2014 <a

href="https://www.ldc.upenn.edu/language-resources/data/data-scholarships">LDC

Data

        Scholarship program</a>!  This program provides university

      students with access to LDC data at no-cost. Students were asked

      to complete an application which consisted of a proposal

      describing their intended use of the data, as well as a letter of

      support from their thesis adviser. We received many solid

      applications and have chosen two proposals to support.   The

      following students will receive no-cost copies of LDC data:<o:p></o:p></p>

    <ul>

      <li>Skye Anderson ~ Tulane University (USA), BA candidate,

        Linguistics.  Skye has been awarded a copy of LDC Standard

        Arabic Morphological Analyzer (SAMA) Version 3.1 for her work in

        author profiling.<br>

        <br>

        <o:p></o:p></li>

      <li>Hao Liu ~ University College London (UK), PhD candidate,

        Speech, Hearing and Phonetic Sciences.  Hao has been awarded a

        copy of Switchboard-1 Release 2, and NXT Switchboard Annotations

        for his work in prosody modeling.<br>

      </li>

    </ul>

    <p class="MsoNormal"><br>

      <o:p></o:p></p>

    <p class="MsoNormal"><a name="member"></a><b>2014 Publications Pipeline

      </b><o:p></o:p></p>

    <p class="MsoNormal">LDC's planned publications for this year <span

        style="mso-spacerun:yes"></span>will include:<o:p></o:p></p>

    <ul>

      <li>2009 NIST Language Recognition Evaluation ~  development data

        from VOA broadcast and CTS telephone speech in target and

        non-target languages. <br>

        <br>

      </li>

      <li>ETS Corpus of Non-Native Written English ~ contains 1100

        essays written for a college-entrance test sampled from eight

        prompts (i.e., topics) <span style="mso-spacerun:yes"> </span>with

        score levels (low/medium/high) for each essay. <br>

        <br>

      </li>

      <li>GALE data ~ including Word Alignment, Broadcast Speech &

        Transcripts, Parallel Text, Parallel Aligned Treebanks in

        Arabic, Chinese, and English.<br>

        <br>

        <o:p></o:p></li>

      <li>Hispanic Accented English ~ contains approximately 30 hours of

        spontaneous speech and read utterances from non-native speakers

        of English with corresponding transcripts.<br>

        <br>

      </li>

      <li>Multi-Channel Wall Street Journal Audio-Visual Corpus

        (MC-WSJ-AV) ~  re-recording of parts of the WSJCAM0 using a

        number of microphones as well as three recording conditions

        resulting in 18-20 channels of audio per recording.<br>

        <br>

      </li>

      <li><a

          style="mso-comment-reference:dd_2;mso-comment-date:20140217T1433">TAC

KBP

          Reference Knowledge Base </a>~  TAC KBP aims to develop and

        evaluate technologies for building and populating knowledge

        bases (KBs) about named entities from unstructured text.  KBP

        systems must either populate an existing reference KB, or else

        build a KB from scratch. The reference KB for is based on a

        snapshot of English Wikipedia snapshot from October 2008 and

        contains a set of entities, each with a canonical name and title

        for the Wikipedia page, an entity type, an automatically parsed

        version of the data from the infobox in the entity's Wikipedia

        article, and a stripped version of the text of the Wiki article.

        <br>

        <br>

      </li>

      <li>USC-SFI MALACH Interviews and Transcripts Czech ~ developed by

        The University of Southern California's Shoah Foundation

        Institute (USC-SFI) and the University of West Bohemia as part

        of the MALACH (Multilingual Access to Large Spoken ArCHives)

        Project. It contains approximately 143 hours of interviews from

        420 interviewees along with transcripts and other documentation.

        <br>

      </li>

    </ul>

    Visit LDC's <a

      href="https://www.ldc.upenn.edu/language-resources/data/obtaining">Obtaining

      Data</a> page for information on membership and data licensing.<br>

    <p class="MsoNormal"><br>

      <o:p></o:p></p>

    <p class="MsoNormal"><b>New publications<br>

      </b></p>

    <p class="MsoNormal"><a name="gale"></a>(1) <a

        href="http://catalog.ldc.upenn.edu/LDC2014T03">GALE

        Arabic-English Parallel Aligned Treebank -- Broadcast News Part

        2</a> was developed by LDC and contains 141,058 tokens of word

      aligned Arabic and English parallel text with treebank

      annotations. This material was used as training data in the DARPA

      GALE (Global Autonomous Language Exploitation) program.<o:p></o:p></p>

    <p class="MsoNormal">Parallel aligned treebanks are treebanks

      annotated with morphological and syntactic structures aligned at

      the sentence level and the sub-sentence level. Such data sets are

      useful for natural language processing and related fields,

      including automatic word alignment system training and evaluation,

      transfer-rule extraction, word sense disambiguation, translation

      lexicon extraction and cultural heritage and cross-linguistic

      studies. With respect to machine translation system development,

      parallel aligned treebanks may improve system performance with

      enhanced syntactic parsers, better rules and knowledge about

      language pairs and reduced word error rate.<o:p></o:p></p>

    <p class="MsoNormal">In this release, the source Arabic data was

      translated into English. Arabic and English treebank annotations

      were performed independently. The parallel texts were then word

      aligned. The material in this corpus corresponds to a portion of

      the Arabic treebanked data in Arabic Treebank - Broadcast News

      v1.0 (<a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T07">LDC2012T07</a>).<o:p></o:p></p>

    <p class="MsoNormal">The source data consists of Arabic broadcast

      news programming collected by LDC in 2007 and 2008. All data is

      encoded as UTF-8. A count of files, words, tokens and segments is

      below.<o:p></o:p></p>

    <table class="MsoNormalTable" style="mso-cellspacing:1.5pt;

      mso-yfti-tbllook:1184" border="1" cellpadding="0">

      <tbody>

        <tr style="mso-yfti-irow:0;mso-yfti-firstrow:yes">

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Language<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Files<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Words<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Tokens<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Segments<o:p></o:p></p>

          </td>

        </tr>

        <tr style="mso-yfti-irow:1;mso-yfti-lastrow:yes">

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Arabic<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">31<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">110,690<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">141,058<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">7,102<o:p></o:p></p>

          </td>

        </tr>

      </tbody>

    </table>

    <p class="MsoNormal">The purpose of the GALE word alignment task was

      to find correspondences between words, phrases or groups of words

      in a set of parallel texts. Arabic-English word alignment

      annotation consisted of the following tasks:<o:p></o:p></p>

    <ul>

      <li>Identifying different types of links: translated (correct or

        incorrect) and not translated (correct or incorrect)<br>

      </li>

      <li>Identifying sentence segments not suitable for annotation,

        e.g., blank segments, incorrectly-segmented segments, segments

        with foreign languages<br>

      </li>

      <li>Tagging unmatched words attached to other words or phrases<o:p></o:p></li>

    </ul>

    <br>

    <p class="MsoNormal"> <a name="saud"></a>(2) <a

        href="http://catalog.ldc.upenn.edu/LDC2014S02">King Saud

        University Arabic Speech Database</a> was developed by <a

        href="http://ksu.edu.sa/en/">King Saud University</a> and

      contains 590 hours of recorded Arabic speech from male and female

      speakers. The utterances include read and spontaneous speech. The

      recordings were conducted in varied environments representing

      quiet and noisy settings. <br>

    </p>

    <p class="MsoNormal">The corpus was designed principally for speaker

      recognition research. The speech sources are sentences, word

      lists, prose and question and answer sessions. Read speech text

      includes the following:</p>

    <ul>

      <li>Sets of sentences devised to cover allophones of each phoneme,

        phonetic balance, and differentiation of accents.</li>

      <li>Word lists developed to minimize missing phonemes and to

        represent nasals fricatives, commonly used words, and numbers.</li>

      <li>Two paragraphs, one from the Quran and another from a book,

        selected because they included all letters of the alphabet and

        were easy to read.<br>

      </li>

    </ul>

    <p class="MsoNormal">Spontaneous speech was captured through

      question and answer sessions between participants and project team

      members. Speakers responded to questions on general topics such as

      the weather and food.<br>

    </p>

    <p class="MsoNormal">Each speaker was recorded in three different

      environments: a sound proof room, an office, and a cafeteria. The

      recordings were collected via microphone and mobile phone and

      averaged between 16-19 minutes. The data was verified for missing

      recordings, problems with the recording system or errors in the

      recording process.<br>

    </p>

    <br>

    <p class="MsoNormal"><o:p></o:p></p>

    <p class="MsoNormal"><a name="openmt"></a>(3) <a

        href="http://catalog.ldc.upenn.edu/LDC2014T02">NIST 2012 Open

        Machine Translation (OpenMT) Progress Test Five Language Source</a>

      was developed by <a href="http://nist.gov/itl/iad/mig/">NIST

        Multimodal Information Group</a>. This release contains the

      evaluation sets (source data and human reference translations),

      DTD, scoring software, and evaluation plan for the OpenMT 2012

      test for Arabic, Chinese, Dari, Farsi, and Korean to English on a

      parallel data set. The set is based on a subset of the

      Arabic-to-English and Chinese-to-English progress tests from the

      OpenMT 2008, 2009 and 2012 evaluations with new source data

      created by humans based on the English reference translation. The

      package was compiled, and scoring software was developed, at NIST,

      making use of newswire and web data and reference translations

      developed by the Linguistic Data Consortium <span

        style="mso-spacerun:yes"> </span>and the <a

        href="http://www.dliflc.edu/">Defense Language Institute Foreign

        Language Center</a>.<o:p></o:p></p>

    <p class="MsoNormal">The objective of the OpenMT evaluation series

      is to support research in, and help advance the state of the art

      of, machine translation (MT) technologies -- technologies that

      translate text between human languages. Input may include all

      forms of text. The goal is for the output to be an adequate and

      fluent translation of the original. The 2012 task included the

      evaluation of five language pairs: Arabic-to-English,

      Chinese-to-English, Dari-to-English, Farsi-to-English and

      Korean-to-English in two source data styles. For general

      information about the NIST OpenMT evaluations, refer to the <a

        href="http://www.nist.gov/itl/iad/mig/openmt.cfm">NIST OpenMT

        website</a>.<o:p></o:p></p>

    <p class="MsoNormal">This evaluation kit includes a single Perl

      script (mteval-v13a.pl) that may be used to produce a translation

      quality score for one (or more) MT systems. The script works by

      comparing the system output translation with a set of (expert)

      reference translations of the same source text. Comparison is

      based on finding sequences of words in the reference translations

      that match word sequences in the system output translation.<o:p></o:p></p>

    <p class="MsoNormal">This release consists of 20 files, four for

      each of the five languages, presented in XML with an included DTD.

      The four files are source and reference data in the following two

      styles:<o:p></o:p></p>

    <ul>

      <li>English-true: an English-oriented translation this requires

        that the text read well and not use any idiomatic expressions in

        the foreign language to convey meaning, unless absolutely

        necessary.<br>

      </li>

      <li>Foreign-true: a translation as close as possible to the

        foreign language, as if the text had originated in that

        language.<o:p></o:p></li>

    </ul>

    <br>

    <hr class="msocomoff" size="1" width="33%" align="left">

    <hr size="2" width="100%">

    <hr class="msocomoff" size="1" width="33%" align="left">

    <pre class="moz-signature" cols="72">-- 

--

Ilya Ahtaridis

Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                  Phone: 1 (215) 573-1275

University of Pennsylvania                    Fax: 1 (215) 573-2175

3600 Market St., Suite 810                        <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a></pre>

    <pre class="moz-signature" cols="72">

</pre>

  </body>

</html>