<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p class="MsoNormal" align="center"><b><b></b></b><i>N</i><i>ew</i><i>

        publication</i><i>s</i><i> </i> </p>

    <p class="MsoNormal" align="center"><b>-  </b><b><a href="#gale">GALE

          Arabic-English Parallel Aligned Treebank -- Newswire</a></b><b> 

        -</b><b><br>

      </b></p>

    <p class="MsoNormal" align="center"><b>-  </b><b><a href="#madcat">MADCAT

          Phase 2 Training Set</a></b><b>  -</b><b></b></p>

    <hr size="2" width="100%">

    <div align="center"><b>New publications</b><br>

    </div>

    <p class="MsoNormal"> <a name="gale"></a>(1) <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2013T10">GALE

        Arabic-English Parallel Aligned Treebank -- Newswire</a>

      (LDC2013T10) was developed by LDC and contains 267,520 tokens of

      word aligned Arabic and English parallel text with treebank

      annotations. This material was used as training data in the DARPA

      GALE  (Global Autonomous Language Exploitation) program. Parallel

      aligned treebanks are treebanks annotated with morphological and

      syntactic structures aligned at the sentence level and the

      sub-sentence level. Such data sets are useful for natural language

      processing and related fields, including automatic word alignment

      system training and evaluation, transfer-rule extraction, word

      sense disambiguation, translation lexicon extraction and cultural

      heritage and cross-linguistic studies. With respect to machine

      translation system development, parallel aligned treebanks may

      improve system performance with enhanced syntactic parsers, better

      rules and knowledge about language pairs and reduced word error

      rate.<o:p></o:p></p>

    <p class="MsoNormal">In this release, the source Arabic data was

      translated into English. Arabic and English treebank annotations

      were performed independently. The parallel texts were then word

      aligned. The material in this corpus corresponds to the Arabic

      treebanked data appearing in Arabic Treebank: Part 3 v 3.2 (<a

href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2010T08">LDC2010T08</a>)

      (ATB) and to the English treebanked data in English Translation

      Treebank: An-Nahar Newswire (<a

href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2012T02">LDC2012T02</a>).<o:p></o:p></p>

    <p class="MsoNormal">The source data consists of Arabic newswire

      from the Lebanese publication An Nahar collected by LDC in 2002.

      All data is encoded as UTF-8. A count of files, words, tokens and

      segments is below.<o:p></o:p></p>

    <table class="MsoNormalTable" style="mso-cellspacing:1.5pt;

      mso-yfti-tbllook:1184" border="1" cellpadding="0">

      <tbody>

        <tr style="mso-yfti-irow:0;mso-yfti-firstrow:yes">

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Language<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Files<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Words<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Tokens<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Segments<o:p></o:p></p>

          </td>

        </tr>

        <tr style="mso-yfti-irow:1;mso-yfti-lastrow:yes">

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Arabic<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">364<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">182,351<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">267,520<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">7,711<o:p></o:p></p>

          </td>

        </tr>

      </tbody>

    </table>

    <p class="MsoNormal"><br>

      Note: Word count is based on the untokenized Arabic source and

      token count is based on the ATB-tokenized Arabic source.<o:p></o:p></p>

    <p class="MsoNormal">The purpose of the GALE word alignment task was

      to find correspondences between words, phrases or groups of words

      in a set of parallel texts. Arabic-English word alignment

      annotation consisted of the following tasks:<o:p></o:p></p>

    <blockquote>

      <p class="MsoNormal">Identifying different types of links:

        translated (correct or incorrect) and not translated (correct or

        incorrect)<o:p></o:p></p>

    </blockquote>

    <blockquote>

      <p class="MsoNormal">Identifying sentence segments not suitable

        for annotation, e.g., blank segments, incorrectly-segmented

        segments, segments with foreign languages<o:p></o:p></p>

    </blockquote>

    <blockquote>

      <p class="MsoNormal">Tagging unmatched words attached to other

        words or phrases<o:p></o:p></p>

    </blockquote>

    <br>

    <o:p></o:p>

    <p class="MsoNormal" align="center">*<o:p></o:p></p>

    <p class="MsoNormal"><a name="madcat"></a>(2) <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2013T09">MADCAT

Phase

2

        Training Set</a> (LDC2013T09) contains all training data created

      by LDC to support Phase 2 of the DARPA MADCAT (Multilingual

      Automatic Document Classification Analysis and

      Translation)Program. The data in this release consists of

      handwritten Arabic documents, scanned at high resolution and

      annotated for the physical coordinates of each line and token.

      Digital transcripts and English translations of each document are

      also provided, with the various content and annotation layers

      integrated in a single MADCAT XML output. <o:p></o:p></p>

    <p class="MsoNormal">The goal of the MADCAT program is to

      automatically convert foreign text images into English

      transcripts. MADCAT Phase 2 data was collected from Arabic source

      documents in three genres: newswire, weblog and newsgroup text.

      Arabic speaking scribes copied documents by hand, following

      specific instructions on writing style (fast, normal, careful),

      writing implement (pen, pencil) and paper (lined, unlined). Prior

      to assignment, source documents were processed to optimize their

      appearance for the handwriting task, which resulted in some

      original source documents being broken into multiple pages for

      handwriting. Each resulting handwritten page was assigned to up to

      five independent scribes, using different writing conditions. <o:p></o:p></p>

    <p class="MsoNormal">The handwritten, transcribed documents were

      checked for quality and completeness, then each page was scanned

      at a high resolution (600 dpi, greyscale) to create a digital

      version of the handwritten document. The scanned images were then

      annotated to indicate the physical coordinates of each line and

      token. Explicit reading order was also labeled, along with any

      errors produced by the scribes when copying the text. The

      annotation results in GEDI XML output files (gedi.xml), which

      include ground truth annotations and source transcripts<o:p></o:p></p>

    <p class="MsoNormal">The final step was to produce a unified data

      format that takes multiple data streams and generates a single

      MADCAT XML output file with all required information. The

      resulting madcat.xml file has these distinct components: (1) a

      text layer that consists of the source text, tokenization and

      sentence segmentation, (2) <span style="mso-spacerun:yes"> </span>an

      image layer that consist of bounding boxes, (3) a scribe

      demographic layer that consists of scribe ID and partition

      (train/test) and (4) a document metadata layer. <o:p></o:p></p>

    <p class="MsoNormal">This release includes 27,814 annotation files

      in both GEDI XML and MADCAT XML formats (gedi.xml and madcat.xml)

      along with their corresponding scanned image files in TIFF format.<o:p></o:p></p>

    <o:p></o:p>

    <hr size="2" width="100%">

    <pre class="moz-signature" cols="72">-- 

--

Ilya Ahtaridis

Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                  Phone: 1 (215) 573-1275

University of Pennsylvania                    Fax: 1 (215) 573-2175

3600 Market St., Suite 810                        <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                 <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a>

</pre>

    <pre class="moz-signature" cols="72">

</pre>

  </body>

</html>