[Corpora-List] New from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Wed Jun 25 20:55:23 UTC 2014


/New publications:/*
*

*- Abstract Meaning Representation (AMR) Annotation Release 1.0 <#amr> -
*

*- ETS Corpus of Non-Native Written English <#ets>  -
*

*- GALE Phase 2 Chinese Broadcast News Parallel Text Part 2 <#gale>  -
*

*- MADCAT Chinese Pilot Training Set <#mad>  -*

------------------------------------------------------------------------
*New publications*

(1) Abstract Meaning Representation (AMR) Annotation Release 1.0 
<https://catalog.ldc.upenn.edu/LDC2014T12> was developed by LDC, 
SDL/Language Weaver, Inc. 
<http://www.sdl.com/products/automated-translation/>, the University of 
Colorado's Center for omputational Language and Educational Research 
<http://clear.colorado.edu/start/index.html> and the Information 
Sciences Institute <http://www.isi.edu/home> at the University of 
Southern California. It contains a sembank (semantic treebank) of over 
13,000 English natural language sentences from newswire, weblogs and web 
discussion forums.

AMR captures "who is doing what to whom" in a sentence. Each sentence is 
paired with a graph that represents its whole-sentence meaning in a 
tree-structure. AMR utilizes PropBank frames, non-core semantic roles, 
within-sentence coreference, named entity annotation, modality, 
negation, questions, quantities, and so on to represent the semantic 
structure of a sentence largely independent of its syntax.

The source data includes discussion forums collected for the DARPA BOLT 
program, Wall Street Journal and translated Xinhua news texts, various 
newswire data from NIST OpenMT evaluations and weblog data used in the 
DARPA GALE program.


*

(2) ETS Corpus of Non-Native Written English 
<https://catalog.ldc.upenn.edu/LDC2014T06> was developed by Educational 
Testing Service <https://www.ets.org/> and is comprised of 12,100 
English essays written by speakers of 11 non-English native languages as 
part of an international test of academic English proficiency, TOEFL 
<http://www.ets.org/toefl/ibt/about> (Test of English as a Foreign 
Language). The test includes reading, writing, listening, and speaking 
sections and is delivered by computer in a secure test center. This 
release contains 1,100 essays for each of the 11 native languages 
sampled from eight topics with information about the score level 
(low/medium/high) for each essay.

The corpus was developed with the specific task of native language 
identification in mind, but is likely to support tasks and studies in 
the educational domain, including grammatical error detection and 
correction and automatic essay scoring, in addition to a broad range of 
research studies in the fields of natural language processing and corpus 
linguistics. For the task of native language identification, the 
following division is recommended: 82% as training data, 9% as 
development data and 9% as test data, split according to the file IDs 
accompanying the data set.

The data is sampled from essays written in 2006 and 2007 by test takers 
whose native languages were Arabic, Chinese, French, German, Hindi, 
Italian, Japanese, Korean, Spanish, Telugu, and Turkish. Original raw 
files for 11,000 of the 12,100 tokenized files are included in this 
release along with prompts (topics) for the essays and metadata about 
the test takers' proficiency level. The data is presented in UTF-8 
formatted text files.


*

(3) GALE Phase 2 Chinese Broadcast News Parallel Text Part 
<https://catalog.ldc.upenn.edu/LDC2014T11>2 was developed by LDC. Along 
with other corpora, the parallel text in this release comprised training 
data for Phase 2 of the DARPA GALE (Global Autonomous Language 
Exploitation) Program. This corpus contains Chinese source text and 
corresponding English translations selected from broadcast news (BN) 
data collected by LDC between 2005 and 2007 and transcribed by LDC or 
under its direction.

This release includes 30 source-translation document pairs, comprising 
206,737 characters of translated material. Data is drawn from 12 
distinct Chinese BN programs broadcast by China Central TV, a national 
and international broadcaster in Mainland China; New Tang Dynasty TV, a 
broadcaster based in the United States; and Phoenix TV, a Hong-Kong 
based satellite television station. The broadcast news recordings in 
this release focus principally on current events.

The data was transcribed by LDC staff and/or transcription vendors under 
contract to LDC in accordance with Quick Rich Transcription guidelines 
developed by LDC. Transcribers indicated sentence boundaries in addition 
to transcribing the text. Data was manually selected for translation 
according to several criteria, including linguistic features, 
transcription features and topic features. The transcribed and segmented 
files were then reformatted into a human-readable translation format and 
assigned to translation vendors. Translators followed LDC's Chinese to 
English translation guidelines. Bilingual LDC staff performed quality 
control procedures on the completed translations.


*

(4) MADCAT (Multilingual Automatic Document Classification Analysis and 
Translation) Chinese Pilot Training Set 
<https://catalog.ldc.upenn.edu/LDC2014T13> contains all training data 
created by LDC to support a Chinese pilot collection in the DARPA MADCAT 
Program. The data in this release consists of handwritten Chinese 
documents, scanned at high resolution and annotated for the physical 
coordinates of each line and token. Digital transcripts and English 
translations of each document are also provided, with the various 
content and annotation layers integrated in a single MADCAT XML output.

The goal of the MADCAT program was to automatically convert foreign text 
images into English transcripts. MADCAT Chinese pilot data was collected 
from Chinese source documents in three genres: newswire, weblog and 
newsgroup text. Chinese speaking "scribes" copied documents by hand, 
following specific instructions on writing style (fast, normal, 
careful), writing implement (pen, pencil) and paper (lined, unlined). 
Prior to assignment, source documents were processed to optimize their 
appearance for the handwriting task, which resulted in some original 
source documents being broken into multiple "pages" for handwriting. 
Each resulting handwritten page was assigned to up to five independent 
scribes, using different writing conditions.

The handwritten, transcribed documents were next checked for quality and 
completeness, then each page was scanned at a high resolution (600 dpi, 
greyscale) to create a digital version of the handwritten document. The 
scanned images were then annotated to indicate the physical coordinates 
of each line and token. Explicit reading order was also labeled, along 
with any errors produced by the scribes when copying the text.

The final step was to produce a unified data format that takes multiple 
data streams and generates a single MADCAT XML output file which 
contains all required information. The resulting madcat.xml file 
contains distinct components: a text layer that consists of the source 
text, tokenization and sentence segmentation; an image layer that 
consist of bounding boxes; a scribe demographic layer that consists of 
scribe ID and partition (train/test); and a document metadata layer.

This release includes 22,284 annotation files in both GEDI XML and 
MADCAT XML formats (gedi.xml and .madcat.xml) along with their 
corresponding scanned image files in TIFF format. The annotation results 
in GEDI XML files include ground truth annotations and source transcripts.


------------------------------------------------------------------------

-- 
--

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810                        ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                 http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140625/f0ed39d2/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list