<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p class="MsoNormal" align="center"><b><br>

      </b><b> </b><i>New publications:</i><br>

      <br>

      <b>-  </b><b> <a href="#speech">GALE Phase 2 Chinese Broadcast

          Conversation Speech</a></b><b>  -<br>

      </b><b> </b><b><br>

      </b><b> -  </b> <b><a href="#transcripts">GALE Phase 2 Chinese

          Broadcast Conversation Transcripts</a></b><b>  -<br>

      </b><b> </b><b><br>

      </b><b> -  </b> <b><a href="#openmt">NIST 2008-2012 Open Machine

          Translation (OpenMT) Progress Test Sets</a></b> 

      -<br>

      <br>

      <b></b><o:p></o:p></p>

    <div class="MsoNormal" style="text-align:center" align="center">

      <hr size="2" width="100%" align="center"> </div>

    <br>

    <p class="MsoNormal" align="center"><b>New publications</b><br>

    </p>

    <p class="MsoNormal"><br>

      <o:p></o:p></p>

    <p class="MsoNormal"><a name="speech"></a>(1) <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2013S04">GALE

Phase

        2 Chinese Broadcast Conversation Speech</a> (LDC2013S04) was

      developed by LDC and is comprised of approximately 120 hours of

      Chinese broadcast conversation speech collected in 2006 and 2007

      by LDC and Hong University of Science and Technology (HKUST), Hong

      Kong, during Phase 2 of the DARPA GALE (Global Autonomous Language

      Exploitation) Program. <o:p></o:p></p>

    <p class="MsoNormal">Corresponding transcripts are released as GALE

      Phase 2 Chinese Broadcast Conversation Transcripts (<a

href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2013T08">LDC2013T08</a>).<o:p></o:p></p>

    <p class="MsoNormal">Broadcast audio for the GALE program was

      collected at the Philadelphia, PA USA facilities of LDC and at

      three remote collection sites: HKUST (Chinese) Medianet, Tunis,

      Tunisia (Arabic) and MTC, Rabat, Morocco (Arabic). The combined

      local and outsourced broadcast collection supported GALE at a rate

      of approximately 300 hours per week of programming from more than

      50 broadcast sources for a total of over 30,000 hours of collected

      broadcast audio over the life of the program.<o:p></o:p></p>

    <p class="MsoNormal">The broadcast conversation recordings in this

      release feature interviews, call-in programs and roundtable

      discussions focusing principally on current events from the

      following sources: Anhui TV, a regional television station in

      Mainland China, Anhui Province; China Central TV (CCTV), a

      national and international broadcaster in Mainland China; Hubei

      TV, a regional broadcaster in Mainland China, Hubei Province; and

      Phoenix TV, a Hong Kong-based satellite television station. A

      table showing the number of programs and hours recorded from each

      source is contained in the readme file. <o:p></o:p></p>

    <p class="MsoNormal">This release contains 202 audio files presented

      in Waveform Audio File format (.wav), 16000 Hz single-channel

      16-bit PCM. Each file was audited by a native Chinese speaker

      following Audit Procedure Specification Version 2.0 which is

      included in this release. The broadcast auditing process served

      three principal goals: as a check on the operation of the

      broadcast collection system equipment by identifying failed,

      incomplete or faulty recordings; as an indicator of broadcast

      schedule changes by identifying instances when the incorrect

      program was recorded; and as a guide for data selection by

      retaining information about the genre, data type and topic of a

      program. <o:p></o:p></p>

    <p class="MsoNormal" align="center">*<o:p></o:p></p>

    <p class="MsoNormal"><br>

      <a name="transcripts"></a>(2) <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2013T08">GALE

Phase

        2 Chinese Broadcast Conversation Transcripts</a> (LDC2013T08)

      was developed by LDC and contains transcriptions of approximately

      120 hours of Chinese broadcast conversation speech collected in

      2006 and 2007 by LDC and Hong University of Science and Technology

      (HKUST), Hong Kong, during Phase 2 of the DARPA GALE (Global

      Autonomous Language Exploitation) Program. <o:p></o:p></p>

    <p class="MsoNormal">Corresponding audio data is released as GALE

      Phase 2 Chinese Broadcast Conversation Speech (<a

href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2013S04">LDC2013S04</a>).<o:p></o:p></p>

    <p class="MsoNormal">The source broadcast conversation recordings

      feature interviews, call-in programs and round table discussions

      focusing principally on current events from the following sources:

      Anhui TV, a regional television station in Mainland China, Anhui

      Province; China Central TV (CCTV), a national and international

      broadcaster in Mainland China; Hubei TV, a regional broadcaster in

      Mainland China, Hubei Province; and Phoenix TV, a Hong Kong-based

      satellite television station.<o:p></o:p></p>

    <p class="MsoNormal">The transcript files are in plain-text,

      tab-delimited format (TDF) with UTF-8 encoding, and the

      transcribed data totals 1,523,373 tokens. The transcripts were

      created with the LDC-developed transcription tool, <a

        href="http://www.ldc.upenn.edu/tools/XTrans/downloads/">XTrans</a>,

      a multi-platform, multilingual, multi-channel transcription tool

      that supports manual transcription and annotation of audio

      recordings. <o:p></o:p></p>

    <p class="MsoNormal">The files in this corpus were transcribed by

      LDC staff and/or by transcription vendors under contract to LDC.

      Transcribers followed LDC’s quick transcription guidelines (QTR)

      and quick rich transcription specification (QRTR) both of which

      are included in the documentation with this release. QTR

      transcription consists of quick (near-)verbatim, time-aligned

      transcripts plus speaker identification with minimal additional

      mark-up. It does not include sentence unit annotation. QRTR

      annotation adds structural information such as topic boundaries

      and manual sentence unit annotation to the core components of a

      quick transcript. Files with QTR as part of the filename were

      developed using QTR transcription. Files with QRTR in the filename

      indicate QRTR transcription.<o:p></o:p></p>

    <p class="MsoNormal"><br>

      <o:p></o:p></p>

    <p class="MsoNormal" align="center">*<o:p></o:p></p>

    <p class="MsoNormal"><a name="openmt"></a>(3) <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2013T07">NIST

2008-2012

        Open Machine Translation (OpenMT) Progress Test Sets</a>

      (LDC2013T07) was developed by <a

        href="http://nist.gov/itl/iad/mig/">NIST Multimodal Information

        Group</a>. This release contains the evaluation sets (source

      data and human reference translations), DTD, scoring software, and

      evaluation plans for the Arabic-to-English and Chinese-to-English

      progress test sets for the NIST OpenMT 2008, 2009, and 2012

      evaluations. The test data remained unseen between evaluations and

      was reused unchanged each time. The package was compiled, and

      scoring software was developed, at NIST, making use of Chinese and

      Arabic newswire and web data and reference translations collected

      and developed by LDC. <o:p></o:p></p>

    <p class="MsoNormal">The objective of the OpenMT evaluation series

      is to support research in, and help advance the state of the art

      of, machine translation (MT) technologies -- technologies that

      translate text between human languages. Input may include all

      forms of text. The goal is for the output to be an adequate and

      fluent translation of the original. <o:p></o:p></p>

    <p class="MsoNormal">The MT evaluation series started in 2001 as

      part of the DARPA TIDES (Translingual Information Detection,

      Extraction) program. Beginning with the 2006 evaluation, the

      evaluations have been driven and coordinated by NIST as NIST

      OpenMT. These evaluations provide an important contribution to the

      direction of research efforts and the calibration of technical

      capabilities in MT. The OpenMT evaluations are intended to be of

      interest to all researchers working on the general problem of

      automatic translation between human languages. To this end, they

      are designed to be simple, to focus on core technology issues and

      to be fully supported. For more general information about the NIST

      OpenMT evaluations, please refer to the <a

        href="http://www.nist.gov/itl/iad/mig/openmt.cfm">NIST OpenMT

        website</a>.<o:p></o:p></p>

    <p class="MsoNormal">This evaluation kit includes a single Perl

      script (mteval-v13a.pl) that may be used to produce a translation

      quality score for one (or more) MT systems. The script works by

      comparing the system output translation with a set of (expert)

      reference translations of the same source text. Comparison is

      based on finding sequences of words in the reference translations

      that match word sequences in the system output translation.<o:p></o:p></p>

    <p class="MsoNormal">This release contains 2,748 documents with

      corresponding source and reference files, the latter of which

      contains four independent human reference translations of the

      source data. The source data is comprised of Arabic and Chinese

      newswire and web data collected by LDC in 2007. The table below

      displays statistics by source, genre, documents, segments and

      source tokens.<o:p></o:p></p>

    <table class="MsoNormalTable" style="mso-cellspacing:1.5pt;

      mso-yfti-tbllook:1184" border="0" cellpadding="0">

      <tbody>

        <tr style="mso-yfti-irow:0;mso-yfti-firstrow:yes">

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Source<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Genre<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Documents<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Segments<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Source Tokens<o:p></o:p></p>

          </td>

        </tr>

        <tr style="mso-yfti-irow:1">

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Arabic<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Newswire<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">84<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">784<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">20039<o:p></o:p></p>

          </td>

        </tr>

        <tr style="mso-yfti-irow:2">

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Arabic<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Web Data<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">51<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">594<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">14793<o:p></o:p></p>

          </td>

        </tr>

        <tr style="mso-yfti-irow:3">

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Chinese<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Newswire<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">82<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">688<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">26923<o:p></o:p></p>

          </td>

        </tr>

        <tr style="mso-yfti-irow:4;mso-yfti-lastrow:yes">

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Chinese<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Web Data<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">40<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">682<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">19112<o:p></o:p></p>

          </td>

        </tr>

      </tbody>

    </table>

    <p class="MsoNormal"><o:p> </o:p><br>

    </p>

    <br>

    <div style="mso-element:comment-list"><br>

      <hr size="2" width="100%"></div>

    <pre class="moz-signature" cols="72">

</pre>

    <pre class="moz-signature" cols="72">-- 

--

Ilya Ahtaridis

Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                  Phone: 1 (215) 573-1275

University of Pennsylvania                    Fax: 1 (215) 573-2175

3600 Market St., Suite 810                        <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                 <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a>

</pre>

  </body>

</html>