<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

</head>

<body bgcolor="#ffffff" text="#000000">

<p class="MsoNormal" style="text-align: center;" align="center">LDC2009T12<br>

-  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T12"><b>2008

CoNLL Shared Task Data</b></a><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T12"> </a>

-<br>

<br>

LDC2009T13<br>

-  <b><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T13">English

Gigaword Fourth Edition</a>  -</b><br>

<br>

LDC2009T09<br>

<b>-  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T09">GALE

Phase 1 Arabic Newsgroup Parallel Text - Part 2</a>  -<o:p> <br>

<br>

</o:p></b></p>

<p class="MsoNormal" style="text-align: center;" align="center"><o:p>The

Linguistic Data Consortium (LDC) would like to announce the

availability of three new publications.</o:p><b><o:p><br>

</o:p></b></p>

<div class="MsoNormal" style="text-align: center;" align="center">

<hr align="center" size="2" width="100%"></div>

<div align="center"><b>New Publications</b><o:p></o:p>

</div>

<p>(1)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T12">2008

CoNLL Shared Task Data</a> contains the trial corpus, training corpus,

development and test data for the <a

 href="http://www.yr-bcn.es/conll2008/">2008

CoNLL (Conference on Computational Natural Language Learning) Shared

Task

Evaluation</a>. The 2008 Shared Task developed syntactic dependency

annotations, including information such as named-entity boundaries and

the

semantic dependencies model roles of both verbal and nominal

predicates. The

materials in the Shared Task data consist of excerpts from the

following

corpora: <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42">Treebank-3</a>

LDC99T42 , <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T33">BNN

Pronoun Coreference and Entity Type Corpus</a> LDC2005T33, <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T14">Proposition

Bank I</a> LDC2004T14 (PropBank) and <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T23">NomBank

v 1.0</a> LDC2008T23. <o:p></o:p></p>

<p>The <a href="http://www.cnts.ua.ac.be/conll2008/">Conference on

Computational Natural Language Learning (CoNLL)</a> is accompanied

every year

by a shared task intended to promote natural language processing

applications

and evaluate them in a standard setting.  The 2008 shared task employed

a

unified dependency-based formalism and merged the task of syntactic

dependency

parsing and the task of identifying semantic arguments and labeling

them with

semantic roles. <o:p></o:p></p>

<p>The 2008 shared task was divided into three subtasks: <o:p></o:p></p>

<ol start="1" type="1">

  <li class="MsoNormal" style="">parsing syntactic dependencies <o:p></o:p></li>

  <li class="MsoNormal" style="">identification and disambiguation of

semantic predicates <o:p></o:p></li>

  <li class="MsoNormal" style="">identification of arguments and

assignment of semantic roles for each predicate <o:p></o:p></li>

</ol>

<p>Several objectives were addressed in this shared task:<o:p></o:p></p>

<ul type="disc">

  <li class="MsoNormal" style="">Semantic Role Labeling (SRL) was

performed and evaluated using a dependency-based representation for

both syntactic and semantic dependencies. While SRL on top of a

dependency treebank has been addressed before, the approach of the 2008

Shared Task was characterized by the following novelties: <o:p></o:p></li>

  <ol start="1" type="1">

    <li class="MsoNormal" style="">The constituent-to-dependency

conversion strategy transformed all annotated semantic arguments in

PropBank and NomBank v 1.0, not just a subset; <o:p></o:p></li>

    <li class="MsoNormal" style="">The annotations addressed

propositions centered around both verbal (PropBank) and nominal

(NomBank) predicates. <o:p></o:p></li>

  </ol>

  <li class="MsoNormal" style="">Based on the observation that a richer

set of syntactic dependencies improves semantic processing, the

syntactic dependencies modeled are more complex than the ones used in

the previous CoNLL shared tasks. For example, the corpus includes

apposition links, dependencies derived from named entity (NE)

structures, and better modeling of long-distance grammatical relations.

    <o:p></o:p></li>

  <li class="MsoNormal" style="">A practical framework is provided for

the joint learning of syntactic and semantic dependencies. <o:p></o:p></li>

</ul>

<br>

<div align="center"><b>*</b><br>

</div>

<p><o:p></o:p></p>

<p>(2)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T13">English

Gigaword Fourth Edition</a>.<span style="">  </span>English

Gigaword, now being released in its fourth edition, is a comprehensive

archive

of newswire text data that has been acquired over several years by the

LDC at

the University of Pennsylvania. The fourth edition includes all of the

contents

in English Gigawaord Third Edition (<a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T07">LDC2007T07</a>)

plus new data covering the 24-month period of January 2007 through

December

2008. <o:p></o:p></p>

<p>The six distinct international sources of English newswire included

in this

edition are the following:<o:p></o:p></p>

<ul type="disc">

  <li class="MsoNormal" style="">Agence France-Presse, English Service

(afp_eng) <o:p></o:p></li>

  <li class="MsoNormal" style="">Associated Press Worldstream, English

Service (apw_eng) <o:p></o:p></li>

  <li class="MsoNormal" style="">Central News Agency of Taiwan, English

Service (cna_eng) <o:p></o:p></li>

  <li class="MsoNormal" style="">Los Angeles Times/Washington Post

Newswire Service (ltw_eng) <o:p></o:p></li>

  <li class="MsoNormal" style="">New York Times Newswire Service

(nyt_eng) <o:p></o:p></li>

  <li class="MsoNormal" style="">Xinhua News Agency, English Service

(xin_eng) <o:p></o:p></li>

</ul>

<p><small><big> New in the Fourth Edition:</big><o:p></o:p></small></p>

<ul type="disc">

  <li class="MsoNormal" style="">Articles with significant Spanish

language content have now been identified and documented. <o:p></o:p></li>

  <li class="MsoNormal" style="">Markup has been simplified and made

consistent throughout the corpus. <o:p></o:p></li>

  <li class="MsoNormal" style="">Information structure has been

simplified. <o:p></o:p></li>

  <li class="MsoNormal" style="">Character entities have been

simplified. <o:p></o:p></li>

</ul>

<div align="center"><b>*</b><br>

</div>

<p><o:p></o:p></p>

<p>(3)  <a href="cid:part1.03030002.01000209@ldc.upenn.edu"><b><span

 style="color: windowtext; text-decoration: none;"><span style=""> </span></span></b><span

 style="color: windowtext; text-decoration: none;"><span style=""><u><span

 style="color: blue;">GALE Phase 1 Arabic Newsgroup Parallel Text -

Part 2</span></u></span></span></a>

contains a total of 145,000 words (263 files) of Arabic newsgroup text

and its

translation selected from thirty-five sources. Newsgroups consist of

posts to

electronic bulletin boards, Usenet newsgroups, discussion groups and

similar

forums. This release was used as training data in Phase 1 (year 1) of

the

DARPA-funded GALE program. This is the second of a two-part release. <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T03">GALE

Phase 1 Arabic Newsgroup Parallel Text - Part 1</a> was released in

early 2009.<o:p></o:p></p>

<p>Preparing the source data involved four stages of work: data

scouting, data

harvesting, formating and data selection.<o:p></o:p></p>

<p class="MsoNormal" style="">Data

scouting involved manually searching the web for suitable newsgroup

text. Data

scouts were assigned particular topics and genres along with a

production

target in order to focus their web search. Formal annotation guidelines

and a

customized annotation toolkit helped data scouts to manage the search

process

and to track progress. <o:p></o:p></p>

<p>Data scouts logged their decisions about potential text of interest

to a

database. A nightly process queried the annotation database and

harvested all

designated URLs. Whenever possible, the entire site was downloaded, not

just

the individual thread or post located by the data scout. Once the text

was

downloaded, its format was standardized so that the data could be more

easily

integrated into downstream annotation processes. Typically, a new

script was

required for each new domain name that was identified. After scripts

were run,

an optional manual process corrected any remaining formatting problems.<br>

<br>

The selected documents were then reviewed for content-suitability using

a

semi-automatic process. A statistical approach was used to rank a

document's

relevance to a set of already-selected documents labeled as "good."

An annotator then reviewed the list of relevance-ranked documents and

selected

those which were suitable for a particular annotation task or for

annotation in

general. These newly-judged documents in turn provided additional input

for the

generation of new ranked lists. <o:p></o:p></p>

<p class="MsoNormal" style="">Manual

sentence units/segments (SU) annotation was also performed as part of

the

transcription task. Three types of end of sentence SU were identified:

statement SU, question SU, and incomplete SU. After transcription and

SU

annotation, files were reformatted into a human-readable translation

format and

assigned to professional translators for careful translation.

Translators

followed LDC's GALE Translation guidelines which describe the makeup of

the

translation team, the source data format, the translation data format,

best

practices for translating certain linguistic features and quality

control

procedures applied to completed translations.<br>

</p>

<hr size="2" width="100%">

<p class="MsoNormal" style=""> <o:p></o:p></p>

<span style="color: black;"></span>

<div align="center"><font face="Courier New, Courier, monospace"><small><small><big>Ilya

Ahtaridis<br>

Membership Coordinator</big><br>

<br>

</small>--------------------------------------------------------------------</small><small><br>

</small></font></div>

<div align="center">

<pre class="moz-signature" cols="72"><font

 face="Courier New, Courier, monospace">Linguistic Data Consortium                     Phone: (215) 573-1275

University of Pennsylvania                       Fax: (215) 573-2175

3600 Market St., Suite 810                         <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

 Philadelphia, PA 19104 USA                   <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>

</div>

<br>

<pre class="moz-signature" cols="72">

</pre>

</body>

</html>