[Corpora-List] New from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Fri Sep 21 12:30:12 UTC 2012
/New publications/
*- GALE Chinese-English Word Alignment and Tagging Training Part 1 --
Newswire and Web <#gale>** -*
*- **MADCAT Phase 1 Training Set* <#madcat> *-*
------------------------------------------------------------------------
*New Publications
*
(1) GALE Chinese-English Word Alignment and Tagging Training Part 1 --
Newswire and Web
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T16>
was developed by LDC and contains 150,068 tokens of word aligned Chinese
and English parallel text enriched with linguistic tags. This material
was used as training data in the DARPA GALE
<http://projects.ldc.upenn.edu/gale/index.html> (Global Autonomous
Language Exploitation) program. This release consists of Chinese source
newswire and web data (newsgroup, weblog) collected by LDC in 2008.
Some approaches to statistical machine translation include the
incorporation of linguistic knowledge in word aligned text as a means to
improve automatic word alignment and machine translation quality. This
is accomplished with two annotation schemes: alignment and tagging.
Alignment identifies minimum translation units and translation relations
by using minimum-match and attachment annotation approaches. A set of
word tags and alignment link tags are designed in the tagging scheme to
describe these translation units and relations. Tagging adds contextual,
syntactic and language-specific features to the alignment annotation.
The Chinese word alignment tasks consisted of the following components:
-Identifying, aligning, and tagging 8 different types of links
-Identifying, attaching, and tagging local-level unmatched words
-Identifying and tagging sentence/discourse-level unmatched words
-Identifying and tagging all instances of Chinese ? (DE) except when
they were a part of a semantic link.
*
(2) MADCAT Phase 1 Training Set
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T15>
contains all training data created by LDC to support Phase 1 of the
DARPA MADCAT Program. The data in this release consists of handwritten
Arabic documents scanned at high resolution and annotated for the
physical coordinates of each line and token. Digital transcripts and
English translations of each document are also provided, with the
various content and annotation layers integrated in a single MADCAT XML
output.
The goal of the MADCAT program is to automatically convert foreign text
images into English transcripts. MADCAT Phase 1 data was collected by
LDC from Arabic source documents in three genres: newswire, weblog and
newsgroup text. Arabic speaking "scribes" copied documents by hand,
following specific instructions on writing style (fast, normal,
careful), writing implement (pen, pencil) and paper (lined, unlined).
Prior to assignment, source documents were processed to optimize their
appearance for the handwriting task, which resulted in some original
source documents being broken into multiple "pages" for handwriting.
Each resulting handwritten page was assigned to up to five independent
scribes, using different writing conditions.
The handwritten, transcribed documents were checked for quality and
completeness, then each page was scanned at a high resolution (600 dpi,
greyscale) to create a digital version of the handwritten document. The
scanned images were then annotated to indicate the physical coordinates
of each line and token. Explicit reading order was also labeled, along
with any errors produced by the scribes when copying the text.
The final step was to produce a unified data format that takes multiple
data streams and generates a single xml output file which contains all
required information. The resulting xml file has these distinct
components: a text layer that consists of the source text, tokenization
and sentence segmentation; an image layer that consist of bounding
boxes; a scribe demographic layer that consists of scribe ID and
partition (train/test); and a document metadata layer. This release
includes 9693 annotation files in MADCAT XML format (.madcat.xml) along
with their corresponding scanned image files in TIFF format.
------------------------------------------------------------------------
--
--
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120921/1e7271d1/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list