<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p class="MsoNormal"><i>New </i><i>publications:</i></p>
<p class="MsoNormal"><b>- <a href="#ace">ACE 2007 Multilingual
Training Corpus</a> -<br>
</b></p>
<p class="MsoNormal"><b>- <a href="#galeae">GALE Arabic-English
Word Alignment -- Broadcast Training Part 1</a></b><b> -<br>
</b></p>
<p class="MsoNormal"><b>- <a href="#gale2">GALE Phase 2 Chinese
Newswire Parallel Text Part 2</a> -</b></p>
<hr size="2" width="100%"><o:p></o:p>
<p class="MsoNormal"><b>New publications</b><br>
<br>
<a name="ace"></a>(1) <a
href="https://catalog.ldc.upenn.edu/LDC2014T18">ACE 2007
Multilingual Training Corpus</a> was developed by LDC and
contains the complete set of Arabic and Spanish training data for
the <a href="http://www.itl.nist.gov/iad/mig/tests/ace/2007/">2007
Automatic Content Extraction</a> (ACE) technology evaluation,
specifically, Arabic and Spanish newswire data and Arabic weblogs
annotated for entities and temporal expressions. The objective of
the ACE program was to develop automatic content extraction
technology to support automatic processing of human language in
text form from a variety of sources including newswire, broadcast
programming and weblogs. In the 2007 evaluation, participants were
tested on system performance for the recognition of entities,
values, temporal expressions, relations, and events in Chinese and
English and for the recognition of entities and temporal
expressions in Arabic and Spanish. LDC's work in the ACE program
is described in more detail on the LDC <a
href="https://www.ldc.upenn.edu/collaborations/past-projects/ace">ACE
project</a> pages.<o:p></o:p></p>
<p class="MsoNormal">The Arabic data is composed of newswire (60%)
published in October 2000-December 2000 and weblogs (40%)
published during the period November 2004-February 2005. The
Spanish data set consists entirely of newswire material from
multiple sources published in January 2005-April 2005. A document
pool was established for each language based on genre and epoch
requirements. Humans reviewed the pool to select individual
documents suitable for ACE annotation, such as documents that were
representative of their genre and contained targeted ACE entity
types. One annotator completed the entity and temporal expression
(TIMEX2) markup in the first pass annotation. This work was
reviewed in the second pass by a senior annotator. TIMEX2 values
were normalized by an annotator specifically trained for that
task.<o:p></o:p></p>
<p class="MsoNormal">The table below describes the amount of data
included in the current release and its annotation status. Corpus
content for each language and data type is represented in the
three stages of annotation: first pass annotation (1P), second
pass annotation (2P) and TIMEX2 normalization and additional
quality control (NORM).<o:p></o:p></p>
<table class="MsoNormalTable" style="mso-cellspacing:1.5pt;
mso-yfti-tbllook:1184" border="1" cellpadding="0">
<tbody>
<tr style="mso-yfti-irow:0;mso-yfti-firstrow:yes">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Arabic<o:p></o:p></p>
</td>
</tr>
<tr style="mso-yfti-irow:1">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Words<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal"> <o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal"> <o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Files<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal"> <o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal"> <o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal"> <o:p></o:p></p>
</td>
</tr>
<tr style="mso-yfti-irow:2">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal"> <o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">1P<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">2P<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">NORM<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">1P<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">2P<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">NORM<o:p></o:p></p>
</td>
</tr>
<tr style="mso-yfti-irow:3">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">NW<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">58,015<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">58,015<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">58,015<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">257<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">257<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">257<o:p></o:p></p>
</td>
</tr>
<tr style="mso-yfti-irow:4">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">WL<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">40,338<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">40,338<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">40,338<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">121<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">121<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">121<o:p></o:p></p>
</td>
</tr>
<tr style="mso-yfti-irow:5">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Total<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">98,353<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">98,353<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">98,353<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">378<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">378<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">378<o:p></o:p></p>
</td>
</tr>
<tr style="mso-yfti-irow:6">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Spanish<o:p></o:p></p>
</td>
<td style="border:none;padding:.75pt .75pt .75pt .75pt"><br>
</td>
<td style="border:none;padding:.75pt .75pt .75pt .75pt"><br>
</td>
<td style="border:none;padding:.75pt .75pt .75pt .75pt"><br>
</td>
<td style="border:none;padding:.75pt .75pt .75pt .75pt"><br>
</td>
<td style="border:none;padding:.75pt .75pt .75pt .75pt"><br>
</td>
<td style="border:none;padding:.75pt .75pt .75pt .75pt"><br>
</td>
</tr>
<tr style="mso-yfti-irow:7">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Words<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal"> <o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal"> <o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Files<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal"> <o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal"> <o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal"> <o:p></o:p></p>
</td>
</tr>
<tr style="mso-yfti-irow:8">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal"> <o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">1P<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">2P<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">NORM<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">1P<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">2P<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">NORM<o:p></o:p></p>
</td>
</tr>
<tr style="mso-yfti-irow:9">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">NW<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">100,401<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">100,401<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">100,401<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">352<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">352<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">352<o:p></o:p></p>
</td>
</tr>
<tr style="mso-yfti-irow:10;mso-yfti-lastrow:yes">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Total<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">100,401<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">100,401<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">100,401<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">352<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">352<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">352<o:p></o:p></p>
</td>
</tr>
</tbody>
</table>
<p class="MsoNormal">For a given document, there is a source .sgm
file together with the .ag.xml and .apf.xml annotation files in
each of the three directories "1p", "2p" and "timex2norm". In
other words, for each newswire story or weblog entry, the three
annotation directories each contain an identical copy of the
source text (SGML .sgm file) along with distinct versions of the
associated annotations (XML .ag.xml, apf.xml files and plain text
.tab files). All files are presented in UTF-8.<o:p></o:p></p>
<br>
<p class="MsoNormal" align="center">*<o:p></o:p></p>
<p class="MsoNormal"><a name="galeae"></a>(2) <a
href="https://catalog.ldc.upenn.edu/LDC2014T19">GALE
Arabic-English Word Alignment -- Broadcast Training Part 1</a>
was developed by LDC and contains 267,257 tokens of word aligned
Arabic and English parallel text enriched with linguistic tags.
This material was used as training data in the DARPA GALE (Global
Autonomous Language Exploitation) program.<o:p></o:p></p>
<p class="MsoNormal">Some approaches to statistical machine
translation include the incorporation of linguistic knowledge in
word aligned text as a means to improve automatic word alignment
and machine translation quality. This is accomplished with two
annotation schemes: alignment and tagging. Alignment identifies
minimum translation units and translation relations by using
minimum-match and attachment annotation approaches. A set of word
tags and alignment link tags are designed in the tagging scheme to
describe these translation units and relations. Tagging adds
contextual, syntactic and language-specific features to the
alignment annotation.<o:p></o:p></p>
<p class="MsoNormal">This release consists of Arabic source
broadcast news and broadcast conversation data collected by LDC
from 2007-2009. The distribution by genre, words, tokens and
segments appears below:<o:p></o:p></p>
<table class="MsoNormalTable" style="mso-cellspacing:1.5pt;
mso-yfti-tbllook:1184" border="1" cellpadding="0">
<tbody>
<tr style="mso-yfti-irow:0;mso-yfti-firstrow:yes">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Language<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Genre<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Files<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Words<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Tokens<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Segments<o:p></o:p></p>
</td>
</tr>
<tr style="mso-yfti-irow:1">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Arabic<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">BC<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">231<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">79,485<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">103,816<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">4,114<o:p></o:p></p>
</td>
</tr>
<tr style="mso-yfti-irow:2">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Arabic<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">BN<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">92<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">131,789<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">163,441<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">7,227<o:p></o:p></p>
</td>
</tr>
<tr style="mso-yfti-irow:3;mso-yfti-lastrow:yes">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Totals<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal"> <o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">323<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">211,274<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">267,257<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">11,341<o:p></o:p></p>
</td>
</tr>
</tbody>
</table>
<p class="MsoNormal">Note that word count is based on the
untokenized Arabic source, and token count is based on the
tokenized Arabic source.<o:p></o:p></p>
<p class="MsoNormal">The Arabic word alignment tasks consisted of
the following components:<o:p></o:p></p>
<ul>
<li>Normalizing tokenized tokens as needed</li>
<li>Identifying different types of links</li>
<li>Identifying sentence segments not suitable for annotation</li>
<li>Tagging unmatched words attached to other words or phrases<o:p></o:p></li>
</ul>
<br>
<o:p></o:p>
<p class="MsoNormal" align="center">*<o:p></o:p></p>
<p class="MsoNormal"><a name="gale2"></a>(3) <a
href="https://catalog.ldc.upenn.edu/LDC2014T20">GALE Phase 2
Chinese Newswire Parallel Text Part 2</a> was developed by LDC.
Along with other corpora, the parallel text in this release
comprised training data for Phase 2 of the DARPA GALE (Global
Autonomous Language Exploitation) Program. This corpus contains
117,895 tokens of Chinese source text and corresponding English
translations selected from newswire data collected by LDC in 2007
and translated by LDC or under its direction.<o:p></o:p></p>
<p class="MsoNormal">This release includes 177 source-translation
document pairs, comprising 117,895 tokens of translated data. Data
is drawn from four distinct Chinese newswire sources: China News
Service, Guangming Daily, People's Daily and People's Liberation
Army Daily.<o:p></o:p></p>
<p class="MsoNormal">Data was manually selected for translation
according to several criteria, including linguistic features and
topic features. The files were formatted into a human-readable
translation format and assigned to translation vendors.
Translators followed LDC's Chinese to English translation
guidelines. Bilingual LDC staff performed quality control
procedures on the completed translations.<o:p></o:p></p>
<p class="MsoNormal">Source data and translations are distributed in
TDF format. TDF files are tab-delimited files containing one
segment of text along with meta information about that segment.
Each field in the TDF file is described in TDF_format.text. All
data are encoded in UTF-8.<o:p></o:p></p>
<br>
<hr size="2" width="100%"> <br>
<pre class="moz-signature" cols="72">--
--
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810 <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a>
</pre>
</body>
</html>