<html>

  <head>

    <meta http-equiv="content-type" content="text/html;

      charset=ISO-8859-1">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <br>

    <p class="MsoNormal" align="center"><i>New publications</i></p>

    <p class="MsoNormal" align="center"><b>-  <a href="#gale">GALE

          Chinese-English Word Alignment and Tagging Training Part 1 --

          Newswire and Web</a></b><b>  -</b></p>

    <p class="MsoNormal" align="center"><b>-  </b><a href="#madcat"><b>MADCAT

          Phase 1 Training Set</b></a>  <b>-</b></p>

    <hr size="2" width="100%"><br>

    <p class="MsoNormal" align="center"><b>New Publications<br>

      </b><o:p></o:p></p>

    <p class="MsoNormal"><a name="gale"></a>(1) <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T16">GALE

Chinese-English

        Word Alignment and Tagging Training Part 1 -- Newswire and Web</a>

      was developed by LDC and contains 150,068 tokens of word aligned

      Chinese and English parallel text enriched with linguistic tags.

      This material was used as training data in the <a

        href="http://projects.ldc.upenn.edu/gale/index.html">DARPA GALE</a>

      (Global Autonomous Language Exploitation) program.  This <span

        style="mso-spacerun:yes"> </span>release consists of Chinese

      source newswire and web data (newsgroup, weblog) collected by LDC

      in 2008.<o:p></o:p></p>

    <p class="MsoNormal">Some approaches to statistical machine

      translation include the incorporation of linguistic knowledge in

      word aligned text as a means to improve automatic word alignment

      and machine translation quality. This is accomplished with two

      annotation schemes: alignment and tagging. Alignment identifies

      minimum translation units and translation relations by using

      minimum-match and attachment annotation approaches. A set of word

      tags and alignment link tags are designed in the tagging scheme to

      describe these translation units and relations. Tagging adds

      contextual, syntactic and language-specific features to the

      alignment annotation. <o:p></o:p></p>

    <p class="MsoNormal">The Chinese word alignment tasks consisted of

      the following components: <o:p></o:p></p>

    <p class="MsoNormal">-Identifying, aligning, and tagging 8 different

      types of links<o:p></o:p></p>

    <p class="MsoNormal">-Identifying, attaching, and tagging

      local-level unmatched words<o:p></o:p></p>

    <p class="MsoNormal">-Identifying and tagging

      sentence/discourse-level unmatched words<o:p></o:p></p>

    <p class="MsoNormal">-Identifying and tagging all instances of

      Chinese <span style="font-family:"MS

        Gothic";mso-bidi-font-family:"MS Gothic"">的</span>

      (DE) except when they were a part of a semantic link.<o:p></o:p></p>

    <div align="center"> *<o:p></o:p></div>

    <p class="MsoNormal"><a name="madcat"></a>(2) <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T15">MADCAT

        Phase 1 Training Set</a> contains all training data created by

      LDC to support Phase 1 of the DARPA MADCAT Program. The data in

      this release consists of handwritten Arabic documents scanned at

      high resolution and annotated for the physical coordinates of each

      line and token. Digital transcripts and English translations of

      each document are also provided, with the various content and

      annotation layers integrated in a single MADCAT XML output. <o:p></o:p></p>

    <p class="MsoNormal">The goal of the MADCAT program is to

      automatically convert foreign text images into English

      transcripts. MADCAT Phase 1 data was collected by LDC from Arabic

      source documents in three genres: newswire, weblog and newsgroup

      text. Arabic speaking "scribes" copied documents by hand,

      following specific instructions on writing style (fast, normal,

      careful), writing implement (pen, pencil) and paper (lined,

      unlined). Prior to assignment, source documents were processed to

      optimize their appearance for the handwriting task, which resulted

      in some original source documents being broken into multiple

      "pages" for handwriting. Each resulting handwritten page was

      assigned to up to five independent scribes, using different

      writing conditions. <o:p></o:p></p>

    <p class="MsoNormal">The handwritten, transcribed documents were <span

        style="mso-spacerun:yes"> </span>checked for quality and

      completeness, then each page was scanned at a high resolution (600

      dpi, greyscale) to create a digital version of the handwritten

      document. The scanned images were then annotated to indicate the

      physical coordinates of each line and token. Explicit reading

      order was also labeled, along with any errors produced by the

      scribes when copying the text. <o:p></o:p></p>

    <p class="MsoNormal">The final step was to produce a unified data

      format that takes multiple data streams and generates a single xml

      output file which contains all required information. The resulting

      xml file <span style="mso-spacerun:yes"> </span>has these

      distinct components: a text layer that consists of the source

      text, tokenization and sentence segmentation; an image layer that

      consist of bounding boxes; a scribe demographic layer that

      consists of scribe ID and partition (train/test); and a document

      metadata layer. This release includes 9693 annotation files in

      MADCAT XML format (.madcat.xml) along with their corresponding

      scanned image files in TIFF format.<o:p></o:p></p>

    <o:p></o:p>

    <p class="MsoNormal"><o:p>  </o:p></p>

    <hr size="2" width="100%">

    <pre class="moz-signature" cols="72">-- 

--

Ilya Ahtaridis

Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                  Phone: 1 (215) 573-1275

University of Pennsylvania                    Fax: 1 (215) 573-2175

3600 Market St., Suite 810                        <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                 <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a>

</pre>

  </body>

</html>