<html>
<head>
<meta http-equiv="content-type" content="text/html;
charset=ISO-8859-1">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<br>
<p class="MsoNormal" align="center"><i>New publications</i></p>
<p class="MsoNormal" align="center"><b>- <a href="#gale">GALE
Chinese-English Word Alignment and Tagging Training Part 1 --
Newswire and Web</a></b><b> -</b></p>
<p class="MsoNormal" align="center"><b>- </b><a href="#madcat"><b>MADCAT
Phase 1 Training Set</b></a> <b>-</b></p>
<hr size="2" width="100%"><br>
<p class="MsoNormal" align="center"><b>New Publications<br>
</b><o:p></o:p></p>
<p class="MsoNormal"><a name="gale"></a>(1) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T16">GALE
Chinese-English
Word Alignment and Tagging Training Part 1 -- Newswire and Web</a>
was developed by LDC and contains 150,068 tokens of word aligned
Chinese and English parallel text enriched with linguistic tags.
This material was used as training data in the <a
href="http://projects.ldc.upenn.edu/gale/index.html">DARPA GALE</a>
(Global Autonomous Language Exploitation) program. This <span
style="mso-spacerun:yes"> </span>release consists of Chinese
source newswire and web data (newsgroup, weblog) collected by LDC
in 2008.<o:p></o:p></p>
<p class="MsoNormal">Some approaches to statistical machine
translation include the incorporation of linguistic knowledge in
word aligned text as a means to improve automatic word alignment
and machine translation quality. This is accomplished with two
annotation schemes: alignment and tagging. Alignment identifies
minimum translation units and translation relations by using
minimum-match and attachment annotation approaches. A set of word
tags and alignment link tags are designed in the tagging scheme to
describe these translation units and relations. Tagging adds
contextual, syntactic and language-specific features to the
alignment annotation. <o:p></o:p></p>
<p class="MsoNormal">The Chinese word alignment tasks consisted of
the following components: <o:p></o:p></p>
<p class="MsoNormal">-Identifying, aligning, and tagging 8 different
types of links<o:p></o:p></p>
<p class="MsoNormal">-Identifying, attaching, and tagging
local-level unmatched words<o:p></o:p></p>
<p class="MsoNormal">-Identifying and tagging
sentence/discourse-level unmatched words<o:p></o:p></p>
<p class="MsoNormal">-Identifying and tagging all instances of
Chinese <span style="font-family:"MS
Gothic";mso-bidi-font-family:"MS Gothic"">的</span>
(DE) except when they were a part of a semantic link.<o:p></o:p></p>
<div align="center"> *<o:p></o:p></div>
<p class="MsoNormal"><a name="madcat"></a>(2) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T15">MADCAT
Phase 1 Training Set</a> contains all training data created by
LDC to support Phase 1 of the DARPA MADCAT Program. The data in
this release consists of handwritten Arabic documents scanned at
high resolution and annotated for the physical coordinates of each
line and token. Digital transcripts and English translations of
each document are also provided, with the various content and
annotation layers integrated in a single MADCAT XML output. <o:p></o:p></p>
<p class="MsoNormal">The goal of the MADCAT program is to
automatically convert foreign text images into English
transcripts. MADCAT Phase 1 data was collected by LDC from Arabic
source documents in three genres: newswire, weblog and newsgroup
text. Arabic speaking "scribes" copied documents by hand,
following specific instructions on writing style (fast, normal,
careful), writing implement (pen, pencil) and paper (lined,
unlined). Prior to assignment, source documents were processed to
optimize their appearance for the handwriting task, which resulted
in some original source documents being broken into multiple
"pages" for handwriting. Each resulting handwritten page was
assigned to up to five independent scribes, using different
writing conditions. <o:p></o:p></p>
<p class="MsoNormal">The handwritten, transcribed documents were <span
style="mso-spacerun:yes"> </span>checked for quality and
completeness, then each page was scanned at a high resolution (600
dpi, greyscale) to create a digital version of the handwritten
document. The scanned images were then annotated to indicate the
physical coordinates of each line and token. Explicit reading
order was also labeled, along with any errors produced by the
scribes when copying the text. <o:p></o:p></p>
<p class="MsoNormal">The final step was to produce a unified data
format that takes multiple data streams and generates a single xml
output file which contains all required information. The resulting
xml file <span style="mso-spacerun:yes"> </span>has these
distinct components: a text layer that consists of the source
text, tokenization and sentence segmentation; an image layer that
consist of bounding boxes; a scribe demographic layer that
consists of scribe ID and partition (train/test); and a document
metadata layer. This release includes 9693 annotation files in
MADCAT XML format (.madcat.xml) along with their corresponding
scanned image files in TIFF format.<o:p></o:p></p>
<o:p></o:p>
<p class="MsoNormal"><o:p> </o:p></p>
<hr size="2" width="100%">
<pre class="moz-signature" cols="72">--
--
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810 <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a>
</pre>
</body>
</html>