Arabic-L:LING:LDC MADCAT

Sat Sep 22 07:09:43 UTC 2012

------------------------------------------------------------------------
Arabic-L: Sat 22 Sep 2012
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
           unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject:LDC MADCAT

-------------------------Messages-----------------------------------
1)
Date: 22 Sep 2012
From:Linguistic Data Consortium ldc at ldc.upenn.edu
Subject:LDC MADCAT

 MADCAT Phase 1 Training Set contains all training data created by LDC
to support Phase 1 of the DARPA MADCAT Program. The data in this
release consists of handwritten Arabic documents scanned at high
resolution and annotated for the physical coordinates of each line and
token. Digital transcripts and English translations of each document
are also provided, with the various content and annotation layers
integrated in a single MADCAT XML output.
The goal of the MADCAT program is to automatically convert foreign
text images into English transcripts. MADCAT Phase 1 data was
collected by LDC from Arabic source documents in three genres:
newswire, weblog and newsgroup text. Arabic speaking "scribes" copied
documents by hand, following specific instructions on writing style
(fast, normal, careful), writing implement (pen, pencil) and paper
(lined, unlined). Prior to assignment, source documents were processed
to optimize their appearance for the handwriting task, which resulted
in some original source documents being broken into multiple "pages"
for handwriting. Each resulting handwritten page was assigned to up to
five independent scribes, using different writing conditions.
The handwritten, transcribed documents were  checked for quality and
completeness, then each page was scanned at a high resolution (600
dpi, greyscale) to create a digital version of the handwritten
document. The scanned images were then annotated to indicate the
physical coordinates of each line and token. Explicit reading order
was also labeled, along with any errors produced by the scribes when
copying the text.
The final step was to produce a unified data format that takes
multiple data streams and generates a single xml output file which
contains all required information. The resulting xml file  has these
distinct components: a text layer that consists of the source text,
tokenization and sentence segmentation; an image layer that consist of
bounding boxes; a scribe demographic layer that consists of scribe ID
and partition (train/test); and a document metadata layer. This
release includes 9693 annotation files in MADCAT XML format
(.madcat.xml) along with their corresponding scanned image files in
TIFF format.

--------------------------------------------------------------------------
End of Arabic-L: 22 Sep 2012