[Corpora-List] New from LDC

Fri Sep 21 12:30:12 UTC 2012

/New publications/

*- GALE Chinese-English Word Alignment and Tagging Training Part 1 -- 
Newswire and Web <#gale>**  -*

*- **MADCAT Phase 1 Training Set* <#madcat> *-*

------------------------------------------------------------------------

*New Publications
*

(1) GALE Chinese-English Word Alignment and Tagging Training Part 1 -- 
Newswire and Web 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T16> 
was developed by LDC and contains 150,068 tokens of word aligned Chinese 
and English parallel text enriched with linguistic tags. This material 
was used as training data in the DARPA GALE 
<http://projects.ldc.upenn.edu/gale/index.html> (Global Autonomous 
Language Exploitation) program.  This release consists of Chinese source 
newswire and web data (newsgroup, weblog) collected by LDC in 2008.

Some approaches to statistical machine translation include the 
incorporation of linguistic knowledge in word aligned text as a means to 
improve automatic word alignment and machine translation quality. This 
is accomplished with two annotation schemes: alignment and tagging. 
Alignment identifies minimum translation units and translation relations 
by using minimum-match and attachment annotation approaches. A set of 
word tags and alignment link tags are designed in the tagging scheme to 
describe these translation units and relations. Tagging adds contextual, 
syntactic and language-specific features to the alignment annotation.

The Chinese word alignment tasks consisted of the following components:

-Identifying, aligning, and tagging 8 different types of links

-Identifying, attaching, and tagging local-level unmatched words

-Identifying and tagging sentence/discourse-level unmatched words

-Identifying and tagging all instances of Chinese ? (DE) except when 
they were a part of a semantic link.

*

(2) MADCAT Phase 1 Training Set 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T15> 
contains all training data created by LDC to support Phase 1 of the 
DARPA MADCAT Program. The data in this release consists of handwritten 
Arabic documents scanned at high resolution and annotated for the 
physical coordinates of each line and token. Digital transcripts and 
English translations of each document are also provided, with the 
various content and annotation layers integrated in a single MADCAT XML 
output.

The goal of the MADCAT program is to automatically convert foreign text 
images into English transcripts. MADCAT Phase 1 data was collected by 
LDC from Arabic source documents in three genres: newswire, weblog and 
newsgroup text. Arabic speaking "scribes" copied documents by hand, 
following specific instructions on writing style (fast, normal, 
careful), writing implement (pen, pencil) and paper (lined, unlined). 
Prior to assignment, source documents were processed to optimize their 
appearance for the handwriting task, which resulted in some original 
source documents being broken into multiple "pages" for handwriting. 
Each resulting handwritten page was assigned to up to five independent 
scribes, using different writing conditions.

The handwritten, transcribed documents were checked for quality and 
completeness, then each page was scanned at a high resolution (600 dpi, 
greyscale) to create a digital version of the handwritten document. The 
scanned images were then annotated to indicate the physical coordinates 
of each line and token. Explicit reading order was also labeled, along 
with any errors produced by the scribes when copying the text.

The final step was to produce a unified data format that takes multiple 
data streams and generates a single xml output file which contains all 
required information. The resulting xml file has these distinct 
components: a text layer that consists of the source text, tokenization 
and sentence segmentation; an image layer that consist of bounding 
boxes; a scribe demographic layer that consists of scribe ID and 
partition (train/test); and a document metadata layer. This release 
includes 9693 annotation files in MADCAT XML format (.madcat.xml) along 
with their corresponding scanned image files in TIFF format.

------------------------------------------------------------------------

-- 
--

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120921/1e7271d1/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora