[Corpora-List] New from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Thu Sep 25 16:11:54 UTC 2014


/New //publications:/

*- ACE 2007 Multilingual Training Corpus <#ace>  -
*

*- GALE Arabic-English Word Alignment -- Broadcast Training Part 1 
<#galeae>**  -
*

*- GALE Phase 2 Chinese Newswire Parallel Text Part 2 <#gale2>  -*

------------------------------------------------------------------------

*New publications*

(1) ACE 2007 Multilingual Training Corpus 
<https://catalog.ldc.upenn.edu/LDC2014T18> was developed by LDC and 
contains the complete set of Arabic and Spanish training data for the 
2007 Automatic Content Extraction 
<http://www.itl.nist.gov/iad/mig/tests/ace/2007/> (ACE) technology 
evaluation, specifically, Arabic and Spanish newswire data and Arabic 
weblogs annotated for entities and temporal expressions. The objective 
of the ACE program was to develop automatic content extraction 
technology to support automatic processing of human language in text 
form from a variety of sources including newswire, broadcast programming 
and weblogs. In the 2007 evaluation, participants were tested on system 
performance for the recognition of entities, values, temporal 
expressions, relations, and events in Chinese and English and for the 
recognition of entities and temporal expressions in Arabic and Spanish. 
LDC's work in the ACE program is described in more detail on the LDC ACE 
project <https://www.ldc.upenn.edu/collaborations/past-projects/ace> pages.

The Arabic data is composed of newswire (60%) published in October 
2000-December 2000 and weblogs (40%) published during the period 
November 2004-February 2005. The Spanish data set consists entirely of 
newswire material from multiple sources published in January 2005-April 
2005. A document pool was established for each language based on genre 
and epoch requirements. Humans reviewed the pool to select individual 
documents suitable for ACE annotation, such as documents that were 
representative of their genre and contained targeted ACE entity types. 
One annotator completed the entity and temporal expression (TIMEX2) 
markup in the first pass annotation. This work was reviewed in the 
second pass by a senior annotator. TIMEX2 values were normalized by an 
annotator specifically trained for that task.

The table below describes the amount of data included in the current 
release and its annotation status. Corpus content for each language and 
data type is represented in the three stages of annotation: first pass 
annotation (1P), second pass annotation (2P) and TIMEX2 normalization 
and additional quality control (NORM).

Arabic

Words

	

	

	

Files

	

	

	

	

1P

	

2P

	

NORM

	

1P

	

2P

	

NORM

NW

	

58,015

	

58,015

	

58,015

	

257

	

257

	

257

WL

	

40,338

	

40,338

	

40,338

	

121

	

121

	

121

Total

	

98,353

	

98,353

	

98,353

	

378

	

378

	

378

Spanish

	
	
	
	
	
	

Words

	

	

	

Files

	

	

	

	

1P

	

2P

	

NORM

	

1P

	

2P

	

NORM

NW

	

100,401

	

100,401

	

100,401

	

352

	

352

	

352

Total

	

100,401

	

100,401

	

100,401

	

352

	

352

	

352

For a given document, there is a source .sgm file together with the 
.ag.xml and .apf.xml annotation files in each of the three directories 
"1p", "2p" and "timex2norm". In other words, for each newswire story or 
weblog entry, the three annotation directories each contain an identical 
copy of the source text (SGML .sgm file) along with distinct versions of 
the associated annotations (XML .ag.xml, apf.xml files and plain text 
.tab files). All files are presented in UTF-8.


*

(2) GALE Arabic-English Word Alignment -- Broadcast Training Part 1 
<https://catalog.ldc.upenn.edu/LDC2014T19> was developed by LDC and 
contains 267,257 tokens of word aligned Arabic and English parallel text 
enriched with linguistic tags. This material was used as training data 
in the DARPA GALE (Global Autonomous Language Exploitation) program.

Some approaches to statistical machine translation include the 
incorporation of linguistic knowledge in word aligned text as a means to 
improve automatic word alignment and machine translation quality. This 
is accomplished with two annotation schemes: alignment and tagging. 
Alignment identifies minimum translation units and translation relations 
by using minimum-match and attachment annotation approaches. A set of 
word tags and alignment link tags are designed in the tagging scheme to 
describe these translation units and relations. Tagging adds contextual, 
syntactic and language-specific features to the alignment annotation.

This release consists of Arabic source broadcast news and broadcast 
conversation data collected by LDC from 2007-2009. The distribution by 
genre, words, tokens and segments appears below:

Language

	

Genre

	

Files

	

Words

	

Tokens

	

Segments

Arabic

	

BC

	

231

	

79,485

	

103,816

	

4,114

Arabic

	

BN

	

92

	

131,789

	

163,441

	

7,227

Totals

	

	

323

	

211,274

	

267,257

	

11,341

Note that word count is based on the untokenized Arabic source, and 
token count is based on the tokenized Arabic source.

The Arabic word alignment tasks consisted of the following components:

  * Normalizing tokenized tokens as needed
  * Identifying different types of links
  * Identifying sentence segments not suitable for annotation
  * Tagging unmatched words attached to other words or phrases


*

(3) GALE Phase 2 Chinese Newswire Parallel Text Part 2 
<https://catalog.ldc.upenn.edu/LDC2014T20> was developed by LDC. Along 
with other corpora, the parallel text in this release comprised training 
data for Phase 2 of the DARPA GALE (Global Autonomous Language 
Exploitation) Program. This corpus contains 117,895 tokens of Chinese 
source text and corresponding English translations selected from 
newswire data collected by LDC in 2007 and translated by LDC or under 
its direction.

This release includes 177 source-translation document pairs, comprising 
117,895 tokens of translated data. Data is drawn from four distinct 
Chinese newswire sources: China News Service, Guangming Daily, People's 
Daily and People's Liberation Army Daily.

Data was manually selected for translation according to several 
criteria, including linguistic features and topic features. The files 
were formatted into a human-readable translation format and assigned to 
translation vendors. Translators followed LDC's Chinese to English 
translation guidelines. Bilingual LDC staff performed quality control 
procedures on the completed translations.

Source data and translations are distributed in TDF format. TDF files 
are tab-delimited files containing one segment of text along with meta 
information about that segment. Each field in the TDF file is described 
in TDF_format.text. All data are encoded in UTF-8.


------------------------------------------------------------------------

-- 
--

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810                        ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                 http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140925/3e0210ca/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list