<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<span style="mso-bidi-font-weight:normal"><i>New publications:</i></span><b
style="mso-bidi-font-weight:normal"><br>
</b>
<p class="MsoNormal"><b style="mso-bidi-font-weight:normal">- <a
href="#amr">Abstract Meaning Representation (AMR) Annotation
Release 1.0</a> -<br>
</b></p>
<p class="MsoNormal"><b style="mso-bidi-font-weight:normal">- <a
href="#ets">ETS Corpus of Non-Native Written English</a> -<br>
</b></p>
<p class="MsoNormal"><b style="mso-bidi-font-weight:normal">- <a
href="#gale">GALE Phase 2 Chinese Broadcast News Parallel Text
Part 2</a> -<br>
</b></p>
<p class="MsoNormal"><b style="mso-bidi-font-weight:normal">- <a
href="#mad">MADCAT Chinese Pilot Training Set</a> -</b></p>
<hr size="2" width="100%"><b style="mso-bidi-font-weight:normal">New
publications</b><o:p></o:p>
<p class="MsoNormal"><a name="amr"></a>(1) <a
href="https://catalog.ldc.upenn.edu/LDC2014T12">Abstract Meaning
Representation (AMR) Annotation Release 1.0</a> was developed by
LDC, <a href="http://www.sdl.com/products/automated-translation/">SDL/Language
Weaver, Inc.</a>, the University of Colorado's <a
href="http://clear.colorado.edu/start/index.html">Center for
omputational Language and Educational Research</a> <span
style="mso-spacerun:yes"> </span>and the <a
href="http://www.isi.edu/home">Information Sciences Institute</a>
at the University of Southern California. It contains a sembank
(semantic treebank) of over 13,000 English natural language
sentences from newswire, weblogs and web discussion forums.<o:p></o:p></p>
<p class="MsoNormal">AMR captures “who is doing what to whom” in a
sentence. Each sentence is paired with a graph that represents its
whole-sentence meaning in a tree-structure. AMR utilizes PropBank
frames, non-core semantic roles, within-sentence coreference,
named entity annotation, modality, negation, questions,
quantities, and so on to represent the semantic structure of a
sentence largely independent of its syntax.<o:p></o:p></p>
<p class="MsoNormal">The source data includes discussion forums
collected for the DARPA BOLT program, Wall Street Journal and
translated Xinhua news texts, various newswire data from NIST
OpenMT evaluations and weblog data used in the DARPA GALE program.
<o:p></o:p></p>
<br>
<p class="MsoNormal" style="text-align:center" align="center">*<o:p></o:p></p>
<p class="MsoNormal"><a name="ets"></a>(2) <a
href="https://catalog.ldc.upenn.edu/LDC2014T06">ETS Corpus of
Non-Native Written English</a> was developed by <a
href="https://www.ets.org/">Educational Testing Service</a> and
is comprised of 12,100 English essays written by speakers of 11
non-English native languages as part of an international test of
academic English proficiency, <a
href="http://www.ets.org/toefl/ibt/about">TOEFL</a> (Test of
English as a Foreign Language). The test includes reading,
writing, listening, and speaking sections and is delivered by
computer in a secure test center. This release contains 1,100
essays for each of the 11 native languages sampled from eight
topics with information about the score level (low/medium/high)
for each essay.<o:p></o:p></p>
<p class="MsoNormal">The corpus was developed with the specific task
of native language identification in mind, but is likely to
support tasks and studies in the educational domain, including
grammatical error detection and correction and automatic essay
scoring, in addition to a broad range of research studies in the
fields of natural language processing and corpus linguistics. For
the task of native language identification, the following division
is recommended: 82% as training data, 9% as development data and
9% as test data, split according to the file IDs accompanying the
data set.<o:p></o:p></p>
<p class="MsoNormal">The data is sampled from essays written in 2006
and 2007 by test takers whose native languages were Arabic,
Chinese, French, German, Hindi, Italian, Japanese, Korean,
Spanish, Telugu, and Turkish. Original raw files for 11,000 of the
12,100 tokenized files are included in this release along with
prompts (topics) for the essays and metadata about the test
takers’ proficiency level. The data is presented in UTF-8
formatted text files.<o:p></o:p></p>
<br>
<div align="center">*<o:p></o:p></div>
<p class="MsoNormal"><a name="gale"></a>(3) <a
href="https://catalog.ldc.upenn.edu/LDC2014T11">GALE Phase 2
Chinese Broadcast News Parallel Text Part </a>2 was developed
by LDC. Along with other corpora, the parallel text in this
release comprised training data for Phase 2 of the DARPA GALE
(Global Autonomous Language Exploitation) Program. This corpus
contains Chinese source text and corresponding English
translations selected from broadcast news (BN) data collected by
LDC between 2005 and 2007 and transcribed by LDC or under its
direction.<o:p></o:p></p>
<p class="MsoNormal">This release includes 30 source-translation
document pairs, comprising 206,737 characters of translated
material. Data is drawn from 12 distinct Chinese BN programs
broadcast by China Central TV, a national and international
broadcaster in Mainland China; New Tang Dynasty TV, a broadcaster
based in the United States; and Phoenix TV, a Hong-Kong based
satellite television station. The broadcast news recordings in
this release focus principally on current events.<o:p></o:p></p>
<p class="MsoNormal">The data was transcribed by LDC staff and/or
transcription vendors under contract to LDC in accordance with
Quick Rich Transcription guidelines developed by LDC. Transcribers
indicated sentence boundaries in addition to transcribing the
text. Data was manually selected for translation according to
several criteria, including linguistic features, transcription
features and topic features. The transcribed and segmented files
were then reformatted into a human-readable translation format and
assigned to translation vendors. Translators followed LDC's
Chinese to English translation guidelines. Bilingual LDC staff
performed quality control procedures on the completed
translations.<o:p></o:p></p>
<br>
<o:p></o:p>
<p class="MsoNormal" style="text-align:center" align="center">*<o:p></o:p></p>
<p class="MsoNormal"><a name="mad"></a>(4) <a
href="https://catalog.ldc.upenn.edu/LDC2014T13">MADCAT
(Multilingual Automatic Document Classification Analysis and
Translation) Chinese Pilot Training Set</a> contains all
training data created by LDC to support a Chinese pilot collection
in the DARPA MADCAT Program. The data in this release consists of
handwritten Chinese documents, scanned at high resolution and
annotated for the physical coordinates of each line and token.
Digital transcripts and English translations of each document are
also provided, with the various content and annotation layers
integrated in a single MADCAT XML output.<o:p></o:p></p>
<p class="MsoNormal">The goal of the MADCAT program was to
automatically convert foreign text images into English
transcripts. MADCAT Chinese pilot data was collected from Chinese
source documents in three genres: newswire, weblog and newsgroup
text. Chinese speaking "scribes" copied documents by hand,
following specific instructions on writing style (fast, normal,
careful), writing implement (pen, pencil) and paper (lined,
unlined). Prior to assignment, source documents were processed to
optimize their appearance for the handwriting task, which resulted
in some original source documents being broken into multiple
"pages" for handwriting. Each resulting handwritten page was
assigned to up to five independent scribes, using different
writing conditions.<o:p></o:p></p>
<p class="MsoNormal">The handwritten, transcribed documents were
next checked for quality and completeness, then each page was
scanned at a high resolution (600 dpi, greyscale) to create a
digital version of the handwritten document. The scanned images
were then annotated to indicate the physical coordinates of each
line and token. Explicit reading order was also labeled, along
with any errors produced by the scribes when copying the text.<o:p></o:p></p>
<p class="MsoNormal">The final step was to produce a unified data
format that takes multiple data streams and generates a single
MADCAT XML output file which contains all required information.
The resulting madcat.xml file contains distinct components: a text
layer that consists of the source text, tokenization and sentence
segmentation; an image layer that consist of bounding boxes; a
scribe demographic layer that consists of scribe ID and partition
(train/test); and a document metadata layer.<o:p></o:p></p>
<p class="MsoNormal">This release includes 22,284 annotation files
in both GEDI XML and MADCAT XML formats (gedi.xml and .madcat.xml)
along with their corresponding scanned image files in TIFF format.
The annotation results in GEDI XML files include ground truth
annotations and source transcripts.<o:p></o:p></p>
<br>
<hr size="2" width="100%"> <br>
<pre class="moz-signature" cols="72">--
--
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810 <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a>
</pre>
</body>
</html>