<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p class="MsoNormal" align="center"><b><b></b></b><i>N</i><i>ew</i><i>
publication</i><i>s</i><i> </i> </p>
<p class="MsoNormal" align="center"><b>- </b><b><a href="#gale">GALE
Arabic-English Parallel Aligned Treebank -- Newswire</a></b><b>
-</b><b><br>
</b></p>
<p class="MsoNormal" align="center"><b>- </b><b><a href="#madcat">MADCAT
Phase 2 Training Set</a></b><b> -</b><b></b></p>
<hr size="2" width="100%">
<div align="center"><b>New publications</b><br>
</div>
<p class="MsoNormal"> <a name="gale"></a>(1) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2013T10">GALE
Arabic-English Parallel Aligned Treebank -- Newswire</a>
(LDC2013T10) was developed by LDC and contains 267,520 tokens of
word aligned Arabic and English parallel text with treebank
annotations. This material was used as training data in the DARPA
GALE (Global Autonomous Language Exploitation) program. Parallel
aligned treebanks are treebanks annotated with morphological and
syntactic structures aligned at the sentence level and the
sub-sentence level. Such data sets are useful for natural language
processing and related fields, including automatic word alignment
system training and evaluation, transfer-rule extraction, word
sense disambiguation, translation lexicon extraction and cultural
heritage and cross-linguistic studies. With respect to machine
translation system development, parallel aligned treebanks may
improve system performance with enhanced syntactic parsers, better
rules and knowledge about language pairs and reduced word error
rate.<o:p></o:p></p>
<p class="MsoNormal">In this release, the source Arabic data was
translated into English. Arabic and English treebank annotations
were performed independently. The parallel texts were then word
aligned. The material in this corpus corresponds to the Arabic
treebanked data appearing in Arabic Treebank: Part 3 v 3.2 (<a
href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2010T08">LDC2010T08</a>)
(ATB) and to the English treebanked data in English Translation
Treebank: An-Nahar Newswire (<a
href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2012T02">LDC2012T02</a>).<o:p></o:p></p>
<p class="MsoNormal">The source data consists of Arabic newswire
from the Lebanese publication An Nahar collected by LDC in 2002.
All data is encoded as UTF-8. A count of files, words, tokens and
segments is below.<o:p></o:p></p>
<table class="MsoNormalTable" style="mso-cellspacing:1.5pt;
mso-yfti-tbllook:1184" border="1" cellpadding="0">
<tbody>
<tr style="mso-yfti-irow:0;mso-yfti-firstrow:yes">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Language<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Files<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Words<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Tokens<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Segments<o:p></o:p></p>
</td>
</tr>
<tr style="mso-yfti-irow:1;mso-yfti-lastrow:yes">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Arabic<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">364<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">182,351<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">267,520<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">7,711<o:p></o:p></p>
</td>
</tr>
</tbody>
</table>
<p class="MsoNormal"><br>
Note: Word count is based on the untokenized Arabic source and
token count is based on the ATB-tokenized Arabic source.<o:p></o:p></p>
<p class="MsoNormal">The purpose of the GALE word alignment task was
to find correspondences between words, phrases or groups of words
in a set of parallel texts. Arabic-English word alignment
annotation consisted of the following tasks:<o:p></o:p></p>
<blockquote>
<p class="MsoNormal">Identifying different types of links:
translated (correct or incorrect) and not translated (correct or
incorrect)<o:p></o:p></p>
</blockquote>
<blockquote>
<p class="MsoNormal">Identifying sentence segments not suitable
for annotation, e.g., blank segments, incorrectly-segmented
segments, segments with foreign languages<o:p></o:p></p>
</blockquote>
<blockquote>
<p class="MsoNormal">Tagging unmatched words attached to other
words or phrases<o:p></o:p></p>
</blockquote>
<br>
<o:p></o:p>
<p class="MsoNormal" align="center">*<o:p></o:p></p>
<p class="MsoNormal"><a name="madcat"></a>(2) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2013T09">MADCAT
Phase
2
Training Set</a> (LDC2013T09) contains all training data created
by LDC to support Phase 2 of the DARPA MADCAT (Multilingual
Automatic Document Classification Analysis and
Translation)Program. The data in this release consists of
handwritten Arabic documents, scanned at high resolution and
annotated for the physical coordinates of each line and token.
Digital transcripts and English translations of each document are
also provided, with the various content and annotation layers
integrated in a single MADCAT XML output. <o:p></o:p></p>
<p class="MsoNormal">The goal of the MADCAT program is to
automatically convert foreign text images into English
transcripts. MADCAT Phase 2 data was collected from Arabic source
documents in three genres: newswire, weblog and newsgroup text.
Arabic speaking scribes copied documents by hand, following
specific instructions on writing style (fast, normal, careful),
writing implement (pen, pencil) and paper (lined, unlined). Prior
to assignment, source documents were processed to optimize their
appearance for the handwriting task, which resulted in some
original source documents being broken into multiple pages for
handwriting. Each resulting handwritten page was assigned to up to
five independent scribes, using different writing conditions. <o:p></o:p></p>
<p class="MsoNormal">The handwritten, transcribed documents were
checked for quality and completeness, then each page was scanned
at a high resolution (600 dpi, greyscale) to create a digital
version of the handwritten document. The scanned images were then
annotated to indicate the physical coordinates of each line and
token. Explicit reading order was also labeled, along with any
errors produced by the scribes when copying the text. The
annotation results in GEDI XML output files (gedi.xml), which
include ground truth annotations and source transcripts<o:p></o:p></p>
<p class="MsoNormal">The final step was to produce a unified data
format that takes multiple data streams and generates a single
MADCAT XML output file with all required information. The
resulting madcat.xml file has these distinct components: (1) a
text layer that consists of the source text, tokenization and
sentence segmentation, (2) <span style="mso-spacerun:yes"> </span>an
image layer that consist of bounding boxes, (3) a scribe
demographic layer that consists of scribe ID and partition
(train/test) and (4) a document metadata layer. <o:p></o:p></p>
<p class="MsoNormal">This release includes 27,814 annotation files
in both GEDI XML and MADCAT XML formats (gedi.xml and madcat.xml)
along with their corresponding scanned image files in TIFF format.<o:p></o:p></p>
<o:p></o:p>
<hr size="2" width="100%">
<pre class="moz-signature" cols="72">--
--
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810 <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a>
</pre>
<pre class="moz-signature" cols="72">
</pre>
</body>
</html>