[Corpora-List] New from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Wed Jun 25 20:55:23 UTC 2014
/New publications:/*
*
*- Abstract Meaning Representation (AMR) Annotation Release 1.0 <#amr> -
*
*- ETS Corpus of Non-Native Written English <#ets> -
*
*- GALE Phase 2 Chinese Broadcast News Parallel Text Part 2 <#gale> -
*
*- MADCAT Chinese Pilot Training Set <#mad> -*
------------------------------------------------------------------------
*New publications*
(1) Abstract Meaning Representation (AMR) Annotation Release 1.0
<https://catalog.ldc.upenn.edu/LDC2014T12> was developed by LDC,
SDL/Language Weaver, Inc.
<http://www.sdl.com/products/automated-translation/>, the University of
Colorado's Center for omputational Language and Educational Research
<http://clear.colorado.edu/start/index.html> and the Information
Sciences Institute <http://www.isi.edu/home> at the University of
Southern California. It contains a sembank (semantic treebank) of over
13,000 English natural language sentences from newswire, weblogs and web
discussion forums.
AMR captures "who is doing what to whom" in a sentence. Each sentence is
paired with a graph that represents its whole-sentence meaning in a
tree-structure. AMR utilizes PropBank frames, non-core semantic roles,
within-sentence coreference, named entity annotation, modality,
negation, questions, quantities, and so on to represent the semantic
structure of a sentence largely independent of its syntax.
The source data includes discussion forums collected for the DARPA BOLT
program, Wall Street Journal and translated Xinhua news texts, various
newswire data from NIST OpenMT evaluations and weblog data used in the
DARPA GALE program.
*
(2) ETS Corpus of Non-Native Written English
<https://catalog.ldc.upenn.edu/LDC2014T06> was developed by Educational
Testing Service <https://www.ets.org/> and is comprised of 12,100
English essays written by speakers of 11 non-English native languages as
part of an international test of academic English proficiency, TOEFL
<http://www.ets.org/toefl/ibt/about> (Test of English as a Foreign
Language). The test includes reading, writing, listening, and speaking
sections and is delivered by computer in a secure test center. This
release contains 1,100 essays for each of the 11 native languages
sampled from eight topics with information about the score level
(low/medium/high) for each essay.
The corpus was developed with the specific task of native language
identification in mind, but is likely to support tasks and studies in
the educational domain, including grammatical error detection and
correction and automatic essay scoring, in addition to a broad range of
research studies in the fields of natural language processing and corpus
linguistics. For the task of native language identification, the
following division is recommended: 82% as training data, 9% as
development data and 9% as test data, split according to the file IDs
accompanying the data set.
The data is sampled from essays written in 2006 and 2007 by test takers
whose native languages were Arabic, Chinese, French, German, Hindi,
Italian, Japanese, Korean, Spanish, Telugu, and Turkish. Original raw
files for 11,000 of the 12,100 tokenized files are included in this
release along with prompts (topics) for the essays and metadata about
the test takers' proficiency level. The data is presented in UTF-8
formatted text files.
*
(3) GALE Phase 2 Chinese Broadcast News Parallel Text Part
<https://catalog.ldc.upenn.edu/LDC2014T11>2 was developed by LDC. Along
with other corpora, the parallel text in this release comprised training
data for Phase 2 of the DARPA GALE (Global Autonomous Language
Exploitation) Program. This corpus contains Chinese source text and
corresponding English translations selected from broadcast news (BN)
data collected by LDC between 2005 and 2007 and transcribed by LDC or
under its direction.
This release includes 30 source-translation document pairs, comprising
206,737 characters of translated material. Data is drawn from 12
distinct Chinese BN programs broadcast by China Central TV, a national
and international broadcaster in Mainland China; New Tang Dynasty TV, a
broadcaster based in the United States; and Phoenix TV, a Hong-Kong
based satellite television station. The broadcast news recordings in
this release focus principally on current events.
The data was transcribed by LDC staff and/or transcription vendors under
contract to LDC in accordance with Quick Rich Transcription guidelines
developed by LDC. Transcribers indicated sentence boundaries in addition
to transcribing the text. Data was manually selected for translation
according to several criteria, including linguistic features,
transcription features and topic features. The transcribed and segmented
files were then reformatted into a human-readable translation format and
assigned to translation vendors. Translators followed LDC's Chinese to
English translation guidelines. Bilingual LDC staff performed quality
control procedures on the completed translations.
*
(4) MADCAT (Multilingual Automatic Document Classification Analysis and
Translation) Chinese Pilot Training Set
<https://catalog.ldc.upenn.edu/LDC2014T13> contains all training data
created by LDC to support a Chinese pilot collection in the DARPA MADCAT
Program. The data in this release consists of handwritten Chinese
documents, scanned at high resolution and annotated for the physical
coordinates of each line and token. Digital transcripts and English
translations of each document are also provided, with the various
content and annotation layers integrated in a single MADCAT XML output.
The goal of the MADCAT program was to automatically convert foreign text
images into English transcripts. MADCAT Chinese pilot data was collected
from Chinese source documents in three genres: newswire, weblog and
newsgroup text. Chinese speaking "scribes" copied documents by hand,
following specific instructions on writing style (fast, normal,
careful), writing implement (pen, pencil) and paper (lined, unlined).
Prior to assignment, source documents were processed to optimize their
appearance for the handwriting task, which resulted in some original
source documents being broken into multiple "pages" for handwriting.
Each resulting handwritten page was assigned to up to five independent
scribes, using different writing conditions.
The handwritten, transcribed documents were next checked for quality and
completeness, then each page was scanned at a high resolution (600 dpi,
greyscale) to create a digital version of the handwritten document. The
scanned images were then annotated to indicate the physical coordinates
of each line and token. Explicit reading order was also labeled, along
with any errors produced by the scribes when copying the text.
The final step was to produce a unified data format that takes multiple
data streams and generates a single MADCAT XML output file which
contains all required information. The resulting madcat.xml file
contains distinct components: a text layer that consists of the source
text, tokenization and sentence segmentation; an image layer that
consist of bounding boxes; a scribe demographic layer that consists of
scribe ID and partition (train/test); and a document metadata layer.
This release includes 22,284 annotation files in both GEDI XML and
MADCAT XML formats (gedi.xml and .madcat.xml) along with their
corresponding scanned image files in TIFF format. The annotation results
in GEDI XML files include ground truth annotations and source transcripts.
------------------------------------------------------------------------
--
--
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140625/f0ed39d2/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list