[Corpora-List] New from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Fri May 29 14:45:54 UTC 2009


LDC2009T12
-  *2008 CoNLL Shared Task Data* 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T12>  
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T12> -

LDC2009T13
-  *English Gigaword Fourth Edition 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T13>  -*

LDC2009T09
*-  GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T09>  -

*

The Linguistic Data Consortium (LDC) would like to announce the 
availability of three new publications.*
*

------------------------------------------------------------------------
*New Publications*

(1)  2008 CoNLL Shared Task Data 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T12> 
contains the trial corpus, training corpus, development and test data 
for the 2008 CoNLL (Conference on Computational Natural Language 
Learning) Shared Task Evaluation <http://www.yr-bcn.es/conll2008/>. The 
2008 Shared Task developed syntactic dependency annotations, including 
information such as named-entity boundaries and the semantic 
dependencies model roles of both verbal and nominal predicates. The 
materials in the Shared Task data consist of excerpts from the following 
corpora: Treebank-3 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42> 
LDC99T42 , BNN Pronoun Coreference and Entity Type Corpus 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T33> 
LDC2005T33, Proposition Bank I 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T14> 
LDC2004T14 (PropBank) and NomBank v 1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T23> 
LDC2008T23.

The Conference on Computational Natural Language Learning (CoNLL) 
<http://www.cnts.ua.ac.be/conll2008/> is accompanied every year by a 
shared task intended to promote natural language processing applications 
and evaluate them in a standard setting.  The 2008 shared task employed 
a unified dependency-based formalism and merged the task of syntactic 
dependency parsing and the task of identifying semantic arguments and 
labeling them with semantic roles.

The 2008 shared task was divided into three subtasks:

   1. parsing syntactic dependencies
   2. identification and disambiguation of semantic predicates
   3. identification of arguments and assignment of semantic roles for
      each predicate

Several objectives were addressed in this shared task:

    * Semantic Role Labeling (SRL) was performed and evaluated using a
      dependency-based representation for both syntactic and semantic
      dependencies. While SRL on top of a dependency treebank has been
      addressed before, the approach of the 2008 Shared Task was
      characterized by the following novelties:
         1. The constituent-to-dependency conversion strategy
            transformed all annotated semantic arguments in PropBank and
            NomBank v 1.0, not just a subset;
         2. The annotations addressed propositions centered around both
            verbal (PropBank) and nominal (NomBank) predicates.
    * Based on the observation that a richer set of syntactic
      dependencies improves semantic processing, the syntactic
      dependencies modeled are more complex than the ones used in the
      previous CoNLL shared tasks. For example, the corpus includes
      apposition links, dependencies derived from named entity (NE)
      structures, and better modeling of long-distance grammatical
      relations.
    * A practical framework is provided for the joint learning of
      syntactic and semantic dependencies.


***

(2)  English Gigaword Fourth Edition 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T13>.  
English Gigaword, now being released in its fourth edition, is a 
comprehensive archive of newswire text data that has been acquired over 
several years by the LDC at the University of Pennsylvania. The fourth 
edition includes all of the contents in English Gigawaord Third Edition 
(LDC2007T07 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T07>) 
plus new data covering the 24-month period of January 2007 through 
December 2008.

The six distinct international sources of English newswire included in 
this edition are the following:

    * Agence France-Presse, English Service (afp_eng)
    * Associated Press Worldstream, English Service (apw_eng)
    * Central News Agency of Taiwan, English Service (cna_eng)
    * Los Angeles Times/Washington Post Newswire Service (ltw_eng)
    * New York Times Newswire Service (nyt_eng)
    * Xinhua News Agency, English Service (xin_eng)

 New in the Fourth Edition:

    * Articles with significant Spanish language content have now been
      identified and documented.
    * Markup has been simplified and made consistent throughout the corpus.
    * Information structure has been simplified.
    * Character entities have been simplified.

***

(3)  * *_GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2_ 
<cid:part1.03030002.01000209 at ldc.upenn.edu> contains a total of 145,000 
words (263 files) of Arabic newsgroup text and its translation selected 
from thirty-five sources. Newsgroups consist of posts to electronic 
bulletin boards, Usenet newsgroups, discussion groups and similar 
forums. This release was used as training data in Phase 1 (year 1) of 
the DARPA-funded GALE program. This is the second of a two-part release. 
GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T03> 
was released in early 2009.

Preparing the source data involved four stages of work: data scouting, 
data harvesting, formating and data selection.

Data scouting involved manually searching the web for suitable newsgroup 
text. Data scouts were assigned particular topics and genres along with 
a production target in order to focus their web search. Formal 
annotation guidelines and a customized annotation toolkit helped data 
scouts to manage the search process and to track progress.

Data scouts logged their decisions about potential text of interest to a 
database. A nightly process queried the annotation database and 
harvested all designated URLs. Whenever possible, the entire site was 
downloaded, not just the individual thread or post located by the data 
scout. Once the text was downloaded, its format was standardized so that 
the data could be more easily integrated into downstream annotation 
processes. Typically, a new script was required for each new domain name 
that was identified. After scripts were run, an optional manual process 
corrected any remaining formatting problems.

The selected documents were then reviewed for content-suitability using 
a semi-automatic process. A statistical approach was used to rank a 
document's relevance to a set of already-selected documents labeled as 
"good." An annotator then reviewed the list of relevance-ranked 
documents and selected those which were suitable for a particular 
annotation task or for annotation in general. These newly-judged 
documents in turn provided additional input for the generation of new 
ranked lists.

Manual sentence units/segments (SU) annotation was also performed as 
part of the transcription task. Three types of end of sentence SU were 
identified: statement SU, question SU, and incomplete SU. After 
transcription and SU annotation, files were reformatted into a 
human-readable translation format and assigned to professional 
translators for careful translation. Translators followed LDC's GALE 
Translation guidelines which describe the makeup of the translation 
team, the source data format, the translation data format, best 
practices for translating certain linguistic features and quality 
control procedures applied to completed translations.

------------------------------------------------------------------------

 

Ilya Ahtaridis
Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
 Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090529/1d28b415/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list