<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

<p align="center">LDC2006T02<br>

<a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T02"><b>Arabic

Gigaword Second Edition</b></a><br>

<b><br>

</b>LDC2006S01<b><br>

<a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S01">CSLU: 

Voices</a></b>

<br>

</p>

<p align="center">LDC2006T04<b><br>

</b><b><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T04">Multiple

Translation Chinese (MTC) Part 4</a><br>

</b></p>

<br>

<p align="center">The Linguistic Data

Consortium (LDC) would

is please to announce the

availability of three new publications.<br>

</p>

<hr size="2" width="100%"><br>

<p align="center"><b>New LDC Publications<br>

<br>

</b></p>

<p>(1)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T02">Arabic

Gigaword Second Edition</a> is is a comprehensive archive of

newswire text data that has been acquired from Arabic news sources by

the Linguistic Data Consortium (LDC).  Arabic Gigaword Second Edition

includes all of the content of the first edition of Arabic Gigaword

(LDC2003T12) as well as new data. </p>

<p>Arabic Gigaword contains five distinct sources of Arabic newswire </p>

<p>

<table>

  <tbody>

    <tr>

      <td colspan="20" align="left"><br>

      </td>

      <td colspan="60" align="left">Agence France Presse</td>

      <td colspan="20" align="left">(afp_arb; formally afa)</td>

    </tr>

    <tr>

      <td colspan="20" align="left"><br>

      </td>

      <td colspan="60" align="left">Al Hayat News Agency</td>

      <td colspan="20" align="left">(hyt_arb; formally alh)</td>

    </tr>

    <tr>

      <td colspan="20" align="left"><br>

      </td>

      <td colspan="60" align="left">An Nahar News Agency</td>

      <td colspan="20" align="left">(nhr_arb; formally ann)</td>

    </tr>

    <tr>

      <td colspan="20" align="left"><br>

      </td>

      <td colspan="60" align="left">Ummah Press</td>

      <td colspan="20" align="left">(umh_arb)</td>

    </tr>

    <tr>

      <td colspan="20" align="left"><br>

      </td>

      <td colspan="60" align="left">Xinhua News Agency</td>

      <td colspan="20" align="left">(xin_arb; formally xia)</td>

    </tr>

  </tbody>

</table>

</p>

<p>The seven-letter codes in the parentheses above consist of the

three-character source name IDs and the three-character language code

("arb") separated by an underscore ("_") character. The three-letter

language code represents the standard Arabic in the ISO 639-3 standard.

In the first edition of the Arabic Gigaword corpus, a simpler

three-character-code scheme was used to identify both the source and

the language. The new convention allows us to distinguish data sets by

source and language more naturally when a single newswire provider

distributes data in multiple languages. </p>

<p>Ummah Press is a new source added to the Second Edition. The

following table shows the new data that appear for the first time in

the Second Edition. </p>

<p>

<table>

  <tbody>

    <tr>

      <td colspan="20" align="left"><br>

      </td>

      <td colspan="30" align="left">Agence France Presse</td>

      <td colspan="30" align="left">2003.01-2004.12</td>

      <td colspan="20" align="right">143766 documents</td>

    </tr>

    <tr>

      <td colspan="20" align="left"><br>

      </td>

      <td colspan="30" align="left">Al Hayat News Agency</td>

      <td colspan="30" align="left">2002.01-2003.12</td>

      <td colspan="20" align="right">64308 documents</td>

    </tr>

    <tr>

      <td colspan="20" align="left"><br>

      </td>

      <td colspan="30" align="left">An Nahar News Agency</td>

      <td colspan="30" align="left">2003.01-2004.01</td>

      <td colspan="20" align="right">16316 documents</td>

    </tr>

    <tr>

      <td colspan="20" align="left"><br>

      </td>

      <td colspan="30" align="left">Ummah Press</td>

      <td colspan="30" align="left">2003.01-2004.12</td>

      <td colspan="20" align="right">4641 documents</td>

    </tr>

    <tr>

      <td colspan="20" align="left"><br>

      </td>

      <td colspan="30" align="left">Xinhua News Agency</td>

      <td colspan="30" align="left">2003.06-2004.12</td>

      <td colspan="20" align="right">106236 documents</td>

    </tr>

  </tbody>

</table>

</p>

<p>There are 423 files, totaling approximately 1.4GB in compressed form

(5359 MB uncompressed and 1591983 K-words).  <br>

</p>

<br>

<p align="center">*<br>

</p>

<p>(2)  The <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S01">CSLU:

Voices</a> corpus contains 12 speakers reading 50 phonetically

rich sentences. The recording procedure involved a "mimicking" approach

which resulted in a high degree of natural time-alignment between

different speakers. The acoustic wave and the concurrent laryngograph

signal were recorded for 1 "free" and 2 "mimicked" renditions of each

sentence. Pitch marks, calculated from the laryngograph signal, and

time marks, the output of a forced-alignment algorithm, have been added

to the corpus.  <br>

</p>

<br>

<br>

<div align="center">*<br>

</div>

<p>(3)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T04">Multiple-Translation

Chinese (MTC) Part 4</a> supports the development of

automatic means for evaluating translation quality.  The LDC was

sponsored to solicit four sets of human translations for a single set

of Chinese source materials. The LDC was also asked to produce

translations from various commercial-off-the-shelf-systems (COTS,

including commercial Machine Translation (MT) systems as well as MT

systems available on the Internet). There are a total of five sets of

COTS outputs, and six output sets from TIDES 2003 MT Evaluation

participants. </p>

<p>To see if automatic evaluation systems, such as BLEU, track human

assessment, the LDC has also performed human assessment on one COTS

output and the 6 TIDES research systems. The corpus includes the

assessment results for one of the 5 COTS systems, the assessment result

for the 6 TIDES research systems, and the specifications used for

conducting the assessments. </p>

<p>Multiple-Translation Chinese (MTC) Part 4 contains two sources of

journalistic Chinese text:<br>

</p>

<p>- Xinhua News Agency: 50 news stories<br>

- AFP News Service: 50 news stories<br>

</p>

<p>There are 100 source files, and 1,100 translation files. All source

data were drawn from LDC's January and February 2003 collection of

Xinhua news Chinese data and AFP Chinese data.  For the Chinese data,

there are approximately 21K-words, while for the English translations,

there are 396K-words in total and 16K unique words.  <br>

</p>

<br>

<br>

<hr size="2" width="100%">

<div align="center"><font face="Courier New"><small><big><font

 face="Times New Roman"><br>

If

you need further

information, or would like to inquire about

membership to the LDC, please email <a class="moz-txt-link-abbreviated"

 href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a> or call +1 215

573 1275.</font></big></small></font><br>

</div>

<p><font face="Courier New"><small><br>

<br>

</small></font>

</p>

<div align="center">--------------------------------------------------------------------<br>

</div>

<div align="center">

<pre class="moz-signature" cols="72">Linguistic Data Consortium                     Phone: (215) 573-1275

University of Pennsylvania                       Fax: (215) 573-2175

3600 Market St., Suite 810                         <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                  <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></pre>

</div>

<p><br>

</p>

</body>

</html>