<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <span style="mso-bidi-font-weight:normal"><i>New publications:</i></span><b

      style="mso-bidi-font-weight:normal"><br>

    </b>

    <p class="MsoNormal"><b style="mso-bidi-font-weight:normal">-  <a

          href="#amr">Abstract Meaning Representation (AMR) Annotation

          Release 1.0</a> -<br>

      </b></p>

    <p class="MsoNormal"><b style="mso-bidi-font-weight:normal">-  <a

          href="#ets">ETS Corpus of Non-Native Written English</a>  -<br>

      </b></p>

    <p class="MsoNormal"><b style="mso-bidi-font-weight:normal">-  <a

          href="#gale">GALE Phase 2 Chinese Broadcast News Parallel Text

          Part 2</a>  -<br>

      </b></p>

    <p class="MsoNormal"><b style="mso-bidi-font-weight:normal">-  <a

          href="#mad">MADCAT Chinese Pilot Training Set</a>  -</b></p>

    <hr size="2" width="100%"><b style="mso-bidi-font-weight:normal">New

      publications</b><o:p></o:p>

    <p class="MsoNormal"><a name="amr"></a>(1) <a

        href="https://catalog.ldc.upenn.edu/LDC2014T12">Abstract Meaning

        Representation (AMR) Annotation Release 1.0</a> was developed by

      LDC, <a href="http://www.sdl.com/products/automated-translation/">SDL/Language

        Weaver, Inc.</a>, the University of Colorado's <a

        href="http://clear.colorado.edu/start/index.html">Center for

        omputational Language and Educational Research</a> <span

        style="mso-spacerun:yes"> </span>and the <a

        href="http://www.isi.edu/home">Information Sciences Institute</a>

      at the University of Southern California. It contains a sembank

      (semantic treebank) of over 13,000 English natural language

      sentences from newswire, weblogs and web discussion forums.<o:p></o:p></p>

    <p class="MsoNormal">AMR captures “who is doing what to whom” in a

      sentence. Each sentence is paired with a graph that represents its

      whole-sentence meaning in a tree-structure. AMR utilizes PropBank

      frames, non-core semantic roles, within-sentence coreference,

      named entity annotation, modality, negation, questions,

      quantities, and so on to represent the semantic structure of a

      sentence largely independent of its syntax.<o:p></o:p></p>

    <p class="MsoNormal">The source data includes discussion forums

      collected for the DARPA BOLT program, Wall Street Journal and

      translated Xinhua news texts, various newswire data from NIST

      OpenMT evaluations and weblog data used in the DARPA GALE program.

      <o:p></o:p></p>

    <br>

    <p class="MsoNormal" style="text-align:center" align="center">*<o:p></o:p></p>

    <p class="MsoNormal"><a name="ets"></a>(2) <a

        href="https://catalog.ldc.upenn.edu/LDC2014T06">ETS Corpus of

        Non-Native Written English</a> was developed by <a

        href="https://www.ets.org/">Educational Testing Service</a> and

      is comprised of 12,100 English essays written by speakers of 11

      non-English native languages as part of an international test of

      academic English proficiency, <a

        href="http://www.ets.org/toefl/ibt/about">TOEFL</a> (Test of

      English as a Foreign Language). The test includes reading,

      writing, listening, and speaking sections and is delivered by

      computer in a secure test center. This release contains 1,100

      essays for each of the 11 native languages sampled from eight

      topics with information about the score level (low/medium/high)

      for each essay.<o:p></o:p></p>

    <p class="MsoNormal">The corpus was developed with the specific task

      of native language identification in mind, but is likely to

      support tasks and studies in the educational domain, including

      grammatical error detection and correction and automatic essay

      scoring, in addition to a broad range of research studies in the

      fields of natural language processing and corpus linguistics. For

      the task of native language identification, the following division

      is recommended: 82% as training data, 9% as development data and

      9% as test data, split according to the file IDs accompanying the

      data set.<o:p></o:p></p>

    <p class="MsoNormal">The data is sampled from essays written in 2006

      and 2007 by test takers whose native languages were Arabic,

      Chinese, French, German, Hindi, Italian, Japanese, Korean,

      Spanish, Telugu, and Turkish. Original raw files for 11,000 of the

      12,100 tokenized files are included in this release along with

      prompts (topics) for the essays and metadata about the test

      takers’ proficiency level. The data is presented in UTF-8

      formatted text files.<o:p></o:p></p>

    <br>

    <div align="center">*<o:p></o:p></div>

    <p class="MsoNormal"><a name="gale"></a>(3) <a

        href="https://catalog.ldc.upenn.edu/LDC2014T11">GALE Phase 2

        Chinese Broadcast News Parallel Text Part </a>2 was developed

      by LDC. Along with other corpora, the parallel text in this

      release comprised training data for Phase 2 of the DARPA GALE

      (Global Autonomous Language Exploitation) Program. This corpus

      contains Chinese source text and corresponding English

      translations selected from broadcast news (BN) data collected by

      LDC between 2005 and 2007 and transcribed by LDC or under its

      direction.<o:p></o:p></p>

    <p class="MsoNormal">This release includes 30 source-translation

      document pairs, comprising 206,737 characters of translated

      material. Data is drawn from 12 distinct Chinese BN programs

      broadcast by China Central TV, a national and international

      broadcaster in Mainland China; New Tang Dynasty TV, a broadcaster

      based in the United States; and Phoenix TV, a Hong-Kong based

      satellite television station. The broadcast news recordings in

      this release focus principally on current events.<o:p></o:p></p>

    <p class="MsoNormal">The data was transcribed by LDC staff and/or

      transcription vendors under contract to LDC in accordance with

      Quick Rich Transcription guidelines developed by LDC. Transcribers

      indicated sentence boundaries in addition to transcribing the

      text. Data was manually selected for translation according to

      several criteria, including linguistic features, transcription

      features and topic features. The transcribed and segmented files

      were then reformatted into a human-readable translation format and

      assigned to translation vendors. Translators followed LDC's

      Chinese to English translation guidelines. Bilingual LDC staff

      performed quality control procedures on the completed

      translations.<o:p></o:p></p>

    <br>

    <o:p></o:p>

    <p class="MsoNormal" style="text-align:center" align="center">*<o:p></o:p></p>

    <p class="MsoNormal"><a name="mad"></a>(4) <a

        href="https://catalog.ldc.upenn.edu/LDC2014T13">MADCAT

        (Multilingual Automatic Document Classification Analysis and

        Translation) Chinese Pilot Training Set</a> contains all

      training data created by LDC to support a Chinese pilot collection

      in the DARPA MADCAT Program. The data in this release consists of

      handwritten Chinese documents, scanned at high resolution and

      annotated for the physical coordinates of each line and token.

      Digital transcripts and English translations of each document are

      also provided, with the various content and annotation layers

      integrated in a single MADCAT XML output.<o:p></o:p></p>

    <p class="MsoNormal">The goal of the MADCAT program was to

      automatically convert foreign text images into English

      transcripts. MADCAT Chinese pilot data was collected from Chinese

      source documents in three genres: newswire, weblog and newsgroup

      text. Chinese speaking "scribes" copied documents by hand,

      following specific instructions on writing style (fast, normal,

      careful), writing implement (pen, pencil) and paper (lined,

      unlined). Prior to assignment, source documents were processed to

      optimize their appearance for the handwriting task, which resulted

      in some original source documents being broken into multiple

      "pages" for handwriting. Each resulting handwritten page was

      assigned to up to five independent scribes, using different

      writing conditions.<o:p></o:p></p>

    <p class="MsoNormal">The handwritten, transcribed documents were

      next checked for quality and completeness, then each page was

      scanned at a high resolution (600 dpi, greyscale) to create a

      digital version of the handwritten document. The scanned images

      were then annotated to indicate the physical coordinates of each

      line and token. Explicit reading order was also labeled, along

      with any errors produced by the scribes when copying the text.<o:p></o:p></p>

    <p class="MsoNormal">The final step was to produce a unified data

      format that takes multiple data streams and generates a single

      MADCAT XML output file which contains all required information.

      The resulting madcat.xml file contains distinct components: a text

      layer that consists of the source text, tokenization and sentence

      segmentation; an image layer that consist of bounding boxes; a

      scribe demographic layer that consists of scribe ID and partition

      (train/test); and a document metadata layer.<o:p></o:p></p>

    <p class="MsoNormal">This release includes 22,284 annotation files

      in both GEDI XML and MADCAT XML formats (gedi.xml and .madcat.xml)

      along with their corresponding scanned image files in TIFF format.

      The annotation results in GEDI XML files include ground truth

      annotations and source transcripts.<o:p></o:p></p>

    <br>

    <hr size="2" width="100%"> <br>

    <pre class="moz-signature" cols="72">-- 

--

Ilya Ahtaridis

Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                  Phone: 1 (215) 573-1275

University of Pennsylvania                    Fax: 1 (215) 573-2175

3600 Market St., Suite 810                        <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                 <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a>

</pre>

  </body>

</html>