<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p class="MsoNormal" align="center"><b><a href="#scholar">Spring

          2013 LDC Data Scholarship Recipients</a></b><b><o:p></o:p></b></p>

    <o:p></o:p>

    <p class="MsoNormal" align="center"><i>New publications:</i><o:p></o:p></p>

    <p class="MsoNormal" align="center"> <b><a href="#gale1">GALE Phase

          2 Arabic Broadcast Conversation Speech Part 1</a></b><b><o:p></o:p></b></p>

    <b> </b>

    <p class="MsoNormal" align="center"> <b><a href="#gale2">GALE Phase

          2 Arabic Broadcast Conversation Transcripts - Part 1</a></b><b><o:p></o:p></b></p>

    <b> </b>

    <p class="MsoNormal" align="center"><b> </b><b><a href="#mt">NIST

          2012 Open Machine Translation (OpenMT) Evaluation</a></b></p>

    <o:p></o:p>

    <div class="MsoNormal" style="text-align:center" align="center">

      <hr size="2" width="100%" align="center"> </div>

    <p class="MsoNormal" align="center"><a name="scholar"></a><b>Spring

        2013 LDC Data Scholarship Recipients</b><o:p></o:p></p>

    <p class="MsoNormal">LDC is pleased to announce the student

      recipients of the Spring 2013 LDC Data Scholarship program!  This

      program provides university students with access to LDC data at

      no-cost. Students were asked to complete an application which

      consisted of a proposal describing their intended use of the data,

      as well as a letter of support from their thesis adviser. We

      received many solid applications and have chosen three proposals

      to support.   The following students will receive no-cost copies

      of LDC data: <o:p></o:p></p>

    <blockquote>

      <p class="MsoNormal">Salima Harrat - Ecole Supérieure

        d’informatique (ESI) (Algeria).  Salima has been awarded a copy

        of <i>Arabic Treebank: Part 3</i> for her work in

        diacritization restoration.<br>

        <br>

        Maulik C. Madhavi - Dhirubhai Ambani Institute of Information

        and Communication Technology (DA-IICT), Gandhinagar (India). 

        Maulik has been awarded a copy of <i>Switchboard Cellular Part

          1 Transcribed Audio and Transcripts</i> and <i>1997 HUB4

          English Evaluation Speech and Transcripts</i> for his work in

        spoken term detection.<br>

        <br>

        Shereen M. Oraby - Arab Academy for Science, Technology, and

        Maritime Transport (Egypt).  Shereen has been awarded a copy of

        <i>Arabic Treebank: Part 1</i> for her work in subjectivity and

        sentiment analysis. <o:p></o:p></p>

    </blockquote>

    <p class="MsoNormal">Please join us in congratulating our student

      recipients!   The next LDC Data Scholarship program is scheduled

      for the Fall 2013 semester. <o:p></o:p></p>

    <p class="MsoNormal"> <o:p></o:p></p>

    <div align="center"><b>New publications</b><o:p></o:p></div>

    <p class="MsoNormal"><o:p> </o:p>(1) <a

href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2013S02">GALE

Phase

        2 Arabic Broadcast Conversation Speech Part 1</a> was developed

      by LDC and is comprised of approximately 123 hours of Arabic

      broadcast conversation speech collected in 2006 and 2007 by LDC as

      part of the DARPA GALE (Global Autonomous Language Exploitation)

      Program. Broadcast audio for the DARPA GALE program was collected

      at LDC’s Philadelphia, PA USA facilities and at three remote

      collection sites. The combined local and outsourced broadcast

      collection supported GALE at a rate of approximately 300 hours per

      week of programming from more than 50 broadcast sources for a

      total of over 30,000 hours of collected broadcast audio over the

      life of the program.<o:p></o:p></p>

    <p class="MsoNormal">LDC's local broadcast collection system is

      highly automated, easily extensible and robust and capable of

      collecting, processing and evaluating hundreds of hours of content

      from several dozen sources per day. The broadcast material is

      served to the system by a set of free-to-air (FTA) satellite

      receivers, commercial direct satellite systems (DSS) such as

      DirecTV, direct broadcast satellite (DBS) receivers, and cable

      television (CATV) feeds. The mapping between receivers and

      recorders is dynamic and modular; all signal routing is performed

      under computer control, using a 256x64 A/V matrix switch. Programs

      are recorded in a high bandwidth A/V format and are then processed

      to extract audio, to generate keyframes and compressed

      audio/video, to produce time-synchronized closed captions (in the

      case of North American English) and to generate automatic speech

      recognition (ASR) output. <o:p></o:p></p>

    <p class="MsoNormal">The broadcast conversation recordings in this

      release feature interviews, call-in programs and round table

      discussions focusing principally on current events from several

      sources. This release contains 143 audio files presented in .wav,

      16000 Hz single-channel 16-bit PCM. Each file was audited by a

      native Arabic speaker following Audit Procedure Specification

      Version 2.0 which is included in this release. The broadcast

      auditing process served three principal goals: as a check on the

      operation of LDCs broadcast collection system equipment by

      identifying failed, incomplete or faulty recordings; as an

      indicator of broadcast schedule changes by identifying instances

      when the incorrect program was recorded; and as a guide for data

      selection by retaining information about a program's genre, data

      type and topic.<br>

      <br>

      <o:p></o:p></p>

    <p class="MsoNormal" align="center">*<o:p></o:p></p>

    <p class="MsoNormal"><a name="gale2"></a>(2) <a

href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2013T04">GALE

Phase

        2 Arabic Broadcast Conversation Transcripts - Part 1</a> was

      developed by LDC and contains transcriptions of approximately 123

      hours of Arabic broadcast conversation speech collected in 2006

      and 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco

      during Phase 2 of the DARPA GALE (Global Autonomous Language

      Exploitation) program. The source broadcast conversation

      recordings feature interviews, call-in programs and round table

      discussions focusing principally on current events from several

      sources.<o:p></o:p></p>

    <p class="MsoNormal">The transcript files are in plain-text,

      tab-delimited format (TDF) with UTF-8 encoding, and the

      transcribed data totals 752,747 tokens. The transcripts were

      created with the LDC-developed transcription tool, <a

        href="http://www.ldc.upenn.edu/tools/XTrans/downloads/">XTrans</a>,

      a multi-platform, multilingual, multi-channel transcription tool

      that supports manual transcription and annotation of audio

      recordings. <o:p></o:p></p>

    <p class="MsoNormal">The files in this corpus were transcribed by

      LDC staff and/or by transcription vendors under contract to LDC.

      Transcribers followed LDCs quick transcription guidelines (QTR)

      and quick rich transcription specification (QRTR) both of which

      are included in the documentation with this release. QTR

      transcription consists of quick (near-)verbatim, time-aligned

      transcripts plus speaker identification with minimal additional

      mark-up. It does not include sentence unit annotation. QRTR

      annotation adds structural information such as topic boundaries

      and manual sentence unit annotation to the core components of a

      quick transcript. Files with QTR as part of the filename were

      developed using QTR transcription. Files with QRTR in the filename

      indicate QRTR transcription.<o:p></o:p></p>

    <o:p></o:p>

    <p class="MsoNormal" align="center">*<o:p></o:p></p>

    <p class="MsoNormal"><a name="mt"></a>(3) <a

href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2013T03">NIST

        2012 Open Machine Translation (OpenMT) Evaluation</a> was

      developed by <a href="http://nist.gov/itl/iad/mig/">NIST

        Multimodal Information Group</a>. This release contains source

      data, reference translations and scoring software used in the NIST

      2012 OpenMT evaluation, specifically, for the Chinese-to-English

      language pair track. The package was compiled and scoring software

      was developed at NIST, making use of Chinese newswire and web data

      and reference translations collected and developed by LDC. The

      objective of the OpenMT evaluation series is to support research

      in, and help advance the state of the art of, machine translation

      (MT) technologies -- technologies that translate text between

      human languages. Input may include all forms of text. The goal is

      for the output to be an adequate and fluent translation of the

      original. <o:p></o:p></p>

    <p class="MsoNormal">The 2012 task was to evaluate five language

      pairs: Arabic-to-English, Chinese-to-English, Dari-to-English,

      Farsi-to-English and Korean-to-English. This release consists of

      the material used in the Chinese-to-English language pair track.

      For more general information about the NIST OpenMT evaluations,

      please refer to the <a

        href="http://www.nist.gov/itl/iad/mig/openmt.cfm">NIST OpenMT

        website</a>.<o:p></o:p></p>

    <p class="MsoNormal">This evaluation kit includes a single Perl

      script (mteval-v13a.pl) that may be used to produce a translation

      quality score for one (or more) MT systems. The script works by

      comparing the system output translation with a set of (expert)

      reference translations of the same source text. Comparison is

      based on finding sequences of words in the reference translations

      that match word sequences in the system output translation.<o:p></o:p></p>

    <p class="MsoNormal">This release contains 222 documents with

      corresponding source and reference files, the latter of which

      contains four independent human reference translations of the

      source data. The source data is comprised of Chinese newswire and

      web data collected by LDC in 2011. A portion of the web data

      concerned the topic of food and was treated as a restricted

      domain. The table below displays statistics by source, genre,

      documents, segments and source tokens.<o:p></o:p></p>

    <table class="MsoNormalTable" style="mso-cellspacing:1.5pt;

      mso-yfti-tbllook:1184" border="0" cellpadding="0">

      <tbody>

        <tr style="mso-yfti-irow:0;mso-yfti-firstrow:yes">

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal"><b>Source</b><b><o:p></o:p></b></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal"><b>Genre</b><b><o:p></o:p></b></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal"><b>Documents</b><b><o:p></o:p></b></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal"><b>Segments</b><b><o:p></o:p></b></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal"><b>Source Tokens</b><b><o:p></o:p></b></p>

          </td>

        </tr>

        <tr style="mso-yfti-irow:1">

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Chinese General<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Newswire<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">45<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">400<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">18184<o:p></o:p></p>

          </td>

        </tr>

        <tr style="mso-yfti-irow:2">

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Chinese General<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Web Data<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">28<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">420<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">15181<o:p></o:p></p>

          </td>

        </tr>

        <tr style="mso-yfti-irow:3;mso-yfti-lastrow:yes">

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Chinese Restricted Domain<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Web Data<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">149<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">2184<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">48422<o:p></o:p></p>

          </td>

        </tr>

      </tbody>

    </table>

    <p class="MsoNormal">The token counts for Chinese data are

      "character" counts, which were obtained by counting tokens

      matching the UNICODE-based regular expression "/w". The Python

      “re” module was used to obtain those counts.<o:p></o:p></p>

    <p class="MsoNormal"><br>

    </p>

    <hr size="2" width="100%">

    <pre class="moz-signature" cols="72">-- 

--

Ilya Ahtaridis

Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                  Phone: 1 (215) 573-1275

University of Pennsylvania                    Fax: 1 (215) 573-2175

3600 Market St., Suite 810                        <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                 <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></pre>

    <br>

  </body>

</html>