[Corpora-List] New from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Fri May 29 14:45:54 UTC 2009
LDC2009T12
- *2008 CoNLL Shared Task Data*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T12>
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T12> -
LDC2009T13
- *English Gigaword Fourth Edition
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T13> -*
LDC2009T09
*- GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T09> -
*
The Linguistic Data Consortium (LDC) would like to announce the
availability of three new publications.*
*
------------------------------------------------------------------------
*New Publications*
(1) 2008 CoNLL Shared Task Data
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T12>
contains the trial corpus, training corpus, development and test data
for the 2008 CoNLL (Conference on Computational Natural Language
Learning) Shared Task Evaluation <http://www.yr-bcn.es/conll2008/>. The
2008 Shared Task developed syntactic dependency annotations, including
information such as named-entity boundaries and the semantic
dependencies model roles of both verbal and nominal predicates. The
materials in the Shared Task data consist of excerpts from the following
corpora: Treebank-3
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42>
LDC99T42 , BNN Pronoun Coreference and Entity Type Corpus
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T33>
LDC2005T33, Proposition Bank I
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T14>
LDC2004T14 (PropBank) and NomBank v 1.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T23>
LDC2008T23.
The Conference on Computational Natural Language Learning (CoNLL)
<http://www.cnts.ua.ac.be/conll2008/> is accompanied every year by a
shared task intended to promote natural language processing applications
and evaluate them in a standard setting. The 2008 shared task employed
a unified dependency-based formalism and merged the task of syntactic
dependency parsing and the task of identifying semantic arguments and
labeling them with semantic roles.
The 2008 shared task was divided into three subtasks:
1. parsing syntactic dependencies
2. identification and disambiguation of semantic predicates
3. identification of arguments and assignment of semantic roles for
each predicate
Several objectives were addressed in this shared task:
* Semantic Role Labeling (SRL) was performed and evaluated using a
dependency-based representation for both syntactic and semantic
dependencies. While SRL on top of a dependency treebank has been
addressed before, the approach of the 2008 Shared Task was
characterized by the following novelties:
1. The constituent-to-dependency conversion strategy
transformed all annotated semantic arguments in PropBank and
NomBank v 1.0, not just a subset;
2. The annotations addressed propositions centered around both
verbal (PropBank) and nominal (NomBank) predicates.
* Based on the observation that a richer set of syntactic
dependencies improves semantic processing, the syntactic
dependencies modeled are more complex than the ones used in the
previous CoNLL shared tasks. For example, the corpus includes
apposition links, dependencies derived from named entity (NE)
structures, and better modeling of long-distance grammatical
relations.
* A practical framework is provided for the joint learning of
syntactic and semantic dependencies.
***
(2) English Gigaword Fourth Edition
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T13>.
English Gigaword, now being released in its fourth edition, is a
comprehensive archive of newswire text data that has been acquired over
several years by the LDC at the University of Pennsylvania. The fourth
edition includes all of the contents in English Gigawaord Third Edition
(LDC2007T07
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T07>)
plus new data covering the 24-month period of January 2007 through
December 2008.
The six distinct international sources of English newswire included in
this edition are the following:
* Agence France-Presse, English Service (afp_eng)
* Associated Press Worldstream, English Service (apw_eng)
* Central News Agency of Taiwan, English Service (cna_eng)
* Los Angeles Times/Washington Post Newswire Service (ltw_eng)
* New York Times Newswire Service (nyt_eng)
* Xinhua News Agency, English Service (xin_eng)
New in the Fourth Edition:
* Articles with significant Spanish language content have now been
identified and documented.
* Markup has been simplified and made consistent throughout the corpus.
* Information structure has been simplified.
* Character entities have been simplified.
***
(3) * *_GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2_
<cid:part1.03030002.01000209 at ldc.upenn.edu> contains a total of 145,000
words (263 files) of Arabic newsgroup text and its translation selected
from thirty-five sources. Newsgroups consist of posts to electronic
bulletin boards, Usenet newsgroups, discussion groups and similar
forums. This release was used as training data in Phase 1 (year 1) of
the DARPA-funded GALE program. This is the second of a two-part release.
GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T03>
was released in early 2009.
Preparing the source data involved four stages of work: data scouting,
data harvesting, formating and data selection.
Data scouting involved manually searching the web for suitable newsgroup
text. Data scouts were assigned particular topics and genres along with
a production target in order to focus their web search. Formal
annotation guidelines and a customized annotation toolkit helped data
scouts to manage the search process and to track progress.
Data scouts logged their decisions about potential text of interest to a
database. A nightly process queried the annotation database and
harvested all designated URLs. Whenever possible, the entire site was
downloaded, not just the individual thread or post located by the data
scout. Once the text was downloaded, its format was standardized so that
the data could be more easily integrated into downstream annotation
processes. Typically, a new script was required for each new domain name
that was identified. After scripts were run, an optional manual process
corrected any remaining formatting problems.
The selected documents were then reviewed for content-suitability using
a semi-automatic process. A statistical approach was used to rank a
document's relevance to a set of already-selected documents labeled as
"good." An annotator then reviewed the list of relevance-ranked
documents and selected those which were suitable for a particular
annotation task or for annotation in general. These newly-judged
documents in turn provided additional input for the generation of new
ranked lists.
Manual sentence units/segments (SU) annotation was also performed as
part of the transcription task. Three types of end of sentence SU were
identified: statement SU, question SU, and incomplete SU. After
transcription and SU annotation, files were reformatted into a
human-readable translation format and assigned to professional
translators for careful translation. Translators followed LDC's GALE
Translation guidelines which describe the makeup of the translation
team, the source data format, the translation data format, best
practices for translating certain linguistic features and quality
control procedures applied to completed translations.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090529/1d28b415/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list