[Corpora-List] New from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Thu Sep 25 16:11:54 UTC 2014
/New //publications:/
*- ACE 2007 Multilingual Training Corpus <#ace> -
*
*- GALE Arabic-English Word Alignment -- Broadcast Training Part 1
<#galeae>** -
*
*- GALE Phase 2 Chinese Newswire Parallel Text Part 2 <#gale2> -*
------------------------------------------------------------------------
*New publications*
(1) ACE 2007 Multilingual Training Corpus
<https://catalog.ldc.upenn.edu/LDC2014T18> was developed by LDC and
contains the complete set of Arabic and Spanish training data for the
2007 Automatic Content Extraction
<http://www.itl.nist.gov/iad/mig/tests/ace/2007/> (ACE) technology
evaluation, specifically, Arabic and Spanish newswire data and Arabic
weblogs annotated for entities and temporal expressions. The objective
of the ACE program was to develop automatic content extraction
technology to support automatic processing of human language in text
form from a variety of sources including newswire, broadcast programming
and weblogs. In the 2007 evaluation, participants were tested on system
performance for the recognition of entities, values, temporal
expressions, relations, and events in Chinese and English and for the
recognition of entities and temporal expressions in Arabic and Spanish.
LDC's work in the ACE program is described in more detail on the LDC ACE
project <https://www.ldc.upenn.edu/collaborations/past-projects/ace> pages.
The Arabic data is composed of newswire (60%) published in October
2000-December 2000 and weblogs (40%) published during the period
November 2004-February 2005. The Spanish data set consists entirely of
newswire material from multiple sources published in January 2005-April
2005. A document pool was established for each language based on genre
and epoch requirements. Humans reviewed the pool to select individual
documents suitable for ACE annotation, such as documents that were
representative of their genre and contained targeted ACE entity types.
One annotator completed the entity and temporal expression (TIMEX2)
markup in the first pass annotation. This work was reviewed in the
second pass by a senior annotator. TIMEX2 values were normalized by an
annotator specifically trained for that task.
The table below describes the amount of data included in the current
release and its annotation status. Corpus content for each language and
data type is represented in the three stages of annotation: first pass
annotation (1P), second pass annotation (2P) and TIMEX2 normalization
and additional quality control (NORM).
Arabic
Words
Files
1P
2P
NORM
1P
2P
NORM
NW
58,015
58,015
58,015
257
257
257
WL
40,338
40,338
40,338
121
121
121
Total
98,353
98,353
98,353
378
378
378
Spanish
Words
Files
1P
2P
NORM
1P
2P
NORM
NW
100,401
100,401
100,401
352
352
352
Total
100,401
100,401
100,401
352
352
352
For a given document, there is a source .sgm file together with the
.ag.xml and .apf.xml annotation files in each of the three directories
"1p", "2p" and "timex2norm". In other words, for each newswire story or
weblog entry, the three annotation directories each contain an identical
copy of the source text (SGML .sgm file) along with distinct versions of
the associated annotations (XML .ag.xml, apf.xml files and plain text
.tab files). All files are presented in UTF-8.
*
(2) GALE Arabic-English Word Alignment -- Broadcast Training Part 1
<https://catalog.ldc.upenn.edu/LDC2014T19> was developed by LDC and
contains 267,257 tokens of word aligned Arabic and English parallel text
enriched with linguistic tags. This material was used as training data
in the DARPA GALE (Global Autonomous Language Exploitation) program.
Some approaches to statistical machine translation include the
incorporation of linguistic knowledge in word aligned text as a means to
improve automatic word alignment and machine translation quality. This
is accomplished with two annotation schemes: alignment and tagging.
Alignment identifies minimum translation units and translation relations
by using minimum-match and attachment annotation approaches. A set of
word tags and alignment link tags are designed in the tagging scheme to
describe these translation units and relations. Tagging adds contextual,
syntactic and language-specific features to the alignment annotation.
This release consists of Arabic source broadcast news and broadcast
conversation data collected by LDC from 2007-2009. The distribution by
genre, words, tokens and segments appears below:
Language
Genre
Files
Words
Tokens
Segments
Arabic
BC
231
79,485
103,816
4,114
Arabic
BN
92
131,789
163,441
7,227
Totals
323
211,274
267,257
11,341
Note that word count is based on the untokenized Arabic source, and
token count is based on the tokenized Arabic source.
The Arabic word alignment tasks consisted of the following components:
* Normalizing tokenized tokens as needed
* Identifying different types of links
* Identifying sentence segments not suitable for annotation
* Tagging unmatched words attached to other words or phrases
*
(3) GALE Phase 2 Chinese Newswire Parallel Text Part 2
<https://catalog.ldc.upenn.edu/LDC2014T20> was developed by LDC. Along
with other corpora, the parallel text in this release comprised training
data for Phase 2 of the DARPA GALE (Global Autonomous Language
Exploitation) Program. This corpus contains 117,895 tokens of Chinese
source text and corresponding English translations selected from
newswire data collected by LDC in 2007 and translated by LDC or under
its direction.
This release includes 177 source-translation document pairs, comprising
117,895 tokens of translated data. Data is drawn from four distinct
Chinese newswire sources: China News Service, Guangming Daily, People's
Daily and People's Liberation Army Daily.
Data was manually selected for translation according to several
criteria, including linguistic features and topic features. The files
were formatted into a human-readable translation format and assigned to
translation vendors. Translators followed LDC's Chinese to English
translation guidelines. Bilingual LDC staff performed quality control
procedures on the completed translations.
Source data and translations are distributed in TDF format. TDF files
are tab-delimited files containing one segment of text along with meta
information about that segment. Each field in the TDF file is described
in TDF_format.text. All data are encoded in UTF-8.
------------------------------------------------------------------------
--
--
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140925/3e0210ca/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list