<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
<p align="center">LDC2006T02<br>
<a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T02"><b>Arabic
Gigaword Second Edition</b></a><br>
<b><br>
</b>LDC2006S01<b><br>
<a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S01">CSLU:
Voices</a></b>
<br>
</p>
<p align="center">LDC2006T04<b><br>
</b><b><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T04">Multiple
Translation Chinese (MTC) Part 4</a><br>
</b></p>
<br>
<p align="center">The Linguistic Data
Consortium (LDC) would
is please to announce the
availability of three new publications.<br>
</p>
<hr size="2" width="100%"><br>
<p align="center"><b>New LDC Publications<br>
<br>
</b></p>
<p>(1) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T02">Arabic
Gigaword Second Edition</a> is is a comprehensive archive of
newswire text data that has been acquired from Arabic news sources by
the Linguistic Data Consortium (LDC). Arabic Gigaword Second Edition
includes all of the content of the first edition of Arabic Gigaword
(LDC2003T12) as well as new data. </p>
<p>Arabic Gigaword contains five distinct sources of Arabic newswire </p>
<p>
<table>
<tbody>
<tr>
<td colspan="20" align="left"><br>
</td>
<td colspan="60" align="left">Agence France Presse</td>
<td colspan="20" align="left">(afp_arb; formally afa)</td>
</tr>
<tr>
<td colspan="20" align="left"><br>
</td>
<td colspan="60" align="left">Al Hayat News Agency</td>
<td colspan="20" align="left">(hyt_arb; formally alh)</td>
</tr>
<tr>
<td colspan="20" align="left"><br>
</td>
<td colspan="60" align="left">An Nahar News Agency</td>
<td colspan="20" align="left">(nhr_arb; formally ann)</td>
</tr>
<tr>
<td colspan="20" align="left"><br>
</td>
<td colspan="60" align="left">Ummah Press</td>
<td colspan="20" align="left">(umh_arb)</td>
</tr>
<tr>
<td colspan="20" align="left"><br>
</td>
<td colspan="60" align="left">Xinhua News Agency</td>
<td colspan="20" align="left">(xin_arb; formally xia)</td>
</tr>
</tbody>
</table>
</p>
<p>The seven-letter codes in the parentheses above consist of the
three-character source name IDs and the three-character language code
("arb") separated by an underscore ("_") character. The three-letter
language code represents the standard Arabic in the ISO 639-3 standard.
In the first edition of the Arabic Gigaword corpus, a simpler
three-character-code scheme was used to identify both the source and
the language. The new convention allows us to distinguish data sets by
source and language more naturally when a single newswire provider
distributes data in multiple languages. </p>
<p>Ummah Press is a new source added to the Second Edition. The
following table shows the new data that appear for the first time in
the Second Edition. </p>
<p>
<table>
<tbody>
<tr>
<td colspan="20" align="left"><br>
</td>
<td colspan="30" align="left">Agence France Presse</td>
<td colspan="30" align="left">2003.01-2004.12</td>
<td colspan="20" align="right">143766 documents</td>
</tr>
<tr>
<td colspan="20" align="left"><br>
</td>
<td colspan="30" align="left">Al Hayat News Agency</td>
<td colspan="30" align="left">2002.01-2003.12</td>
<td colspan="20" align="right">64308 documents</td>
</tr>
<tr>
<td colspan="20" align="left"><br>
</td>
<td colspan="30" align="left">An Nahar News Agency</td>
<td colspan="30" align="left">2003.01-2004.01</td>
<td colspan="20" align="right">16316 documents</td>
</tr>
<tr>
<td colspan="20" align="left"><br>
</td>
<td colspan="30" align="left">Ummah Press</td>
<td colspan="30" align="left">2003.01-2004.12</td>
<td colspan="20" align="right">4641 documents</td>
</tr>
<tr>
<td colspan="20" align="left"><br>
</td>
<td colspan="30" align="left">Xinhua News Agency</td>
<td colspan="30" align="left">2003.06-2004.12</td>
<td colspan="20" align="right">106236 documents</td>
</tr>
</tbody>
</table>
</p>
<p>There are 423 files, totaling approximately 1.4GB in compressed form
(5359 MB uncompressed and 1591983 K-words). <br>
</p>
<br>
<p align="center">*<br>
</p>
<p>(2) The <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S01">CSLU:
Voices</a> corpus contains 12 speakers reading 50 phonetically
rich sentences. The recording procedure involved a "mimicking" approach
which resulted in a high degree of natural time-alignment between
different speakers. The acoustic wave and the concurrent laryngograph
signal were recorded for 1 "free" and 2 "mimicked" renditions of each
sentence. Pitch marks, calculated from the laryngograph signal, and
time marks, the output of a forced-alignment algorithm, have been added
to the corpus. <br>
</p>
<br>
<br>
<div align="center">*<br>
</div>
<p>(3) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T04">Multiple-Translation
Chinese (MTC) Part 4</a> supports the development of
automatic means for evaluating translation quality. The LDC was
sponsored to solicit four sets of human translations for a single set
of Chinese source materials. The LDC was also asked to produce
translations from various commercial-off-the-shelf-systems (COTS,
including commercial Machine Translation (MT) systems as well as MT
systems available on the Internet). There are a total of five sets of
COTS outputs, and six output sets from TIDES 2003 MT Evaluation
participants. </p>
<p>To see if automatic evaluation systems, such as BLEU, track human
assessment, the LDC has also performed human assessment on one COTS
output and the 6 TIDES research systems. The corpus includes the
assessment results for one of the 5 COTS systems, the assessment result
for the 6 TIDES research systems, and the specifications used for
conducting the assessments. </p>
<p>Multiple-Translation Chinese (MTC) Part 4 contains two sources of
journalistic Chinese text:<br>
</p>
<p>- Xinhua News Agency: 50 news stories<br>
- AFP News Service: 50 news stories<br>
</p>
<p>There are 100 source files, and 1,100 translation files. All source
data were drawn from LDC's January and February 2003 collection of
Xinhua news Chinese data and AFP Chinese data. For the Chinese data,
there are approximately 21K-words, while for the English translations,
there are 396K-words in total and 16K unique words. <br>
</p>
<br>
<br>
<hr size="2" width="100%">
<div align="center"><font face="Courier New"><small><big><font
face="Times New Roman"><br>
If
you need further
information, or would like to inquire about
membership to the LDC, please email <a class="moz-txt-link-abbreviated"
href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a> or call +1 215
573 1275.</font></big></small></font><br>
</div>
<p><font face="Courier New"><small><br>
<br>
</small></font>
</p>
<div align="center">--------------------------------------------------------------------<br>
</div>
<div align="center">
<pre class="moz-signature" cols="72">Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 <a
class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a
class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></pre>
</div>
<p><br>
</p>
</body>
</html>