<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

</head>

<body bgcolor="#ffffff" text="#000000">

<div align="center"><b><span

 style="font-size: 12pt; font-family: "Times New Roman";">- </span><a

 href="#onto">Free

Copies of OntoNotes Available</a></b><b><span

 style="font-size: 12pt; font-family: "Times New Roman";"> -</span></b><br>

<b><span style="font-size: 12pt; font-family: "Times New Roman";"></span></b><b><span

 style="font-size: 12pt; font-family: "Times New Roman";"></span></b><span

 style=""><span style=""></span></span><br>

<i>New Publications:<br>

<br>

</i>LDC2010T16<br>

<b>- <a href="#bengali">Indian

Language Part-of-Speech Tagset: Bengali</a></b><b> -</b><br>

<br>

LDC2010T15<br>

<b>- <a href="#muc">Message

Understanding Conference 7 Timed (MUC7_T)</a></b><b> -</b></div>

<p class="MsoNormal" style="margin-left: 0.5in; text-indent: -0.25in;"

 align="center"><span dir="ltr"><o:p></o:p></span></p>

<p class="MsoNormal" style="margin-left: 51pt; text-indent: -51pt;"><o:p></o:p><span

 style="font-size: 7pt;"><b><br style="">

<!--[endif]--><o:p></o:p></b></span></p>

<hr size="2" width="100%">

<p class="MsoNormal" align="center"><a name="onto"></a><b><span

 style="font-size: 12pt; font-family: "Times New Roman";">Free Copies

of OntoNotes Available<br>

</span></b></p>

<p class="MsoNormal"><b><span

 style="font-size: 12pt; font-family: "Times New Roman";"></span></b>LDC

is pleased to announce that the OntoNotes data sets are

now available at no-cost.  The OntoNotes project is a collaborative

effort

between BBN Technologies, the University of Colorado, the University of

Pennsylvania, and the University of Southern California's Information

Sciences

Institute. The goal of the project is to annotate a large corpus

comprising

various genres of text (news, conversational telephone speech, weblogs,

use

net, broadcast, talk shows) in three languages (English, Chinese, and

Arabic)

with structural information (syntax and predicate argument structure)

and

shallow semantics (word sense linked to an ontology and coreference).<br>

<br>

OntoNotes builds on and extends two time-tested resources, the <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42">Penn

Treebank</a> for syntax and the <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T14">Penn

PropBank</a> for predicate-argument structure. Its semantic

representation will

include word sense disambiguation for verbs and some nouns, with many

of the word

senses connected to an ontology, and coreference. The current goals

call for

annotation of over a million words each of English and Chinese, and

half a

million words of Arabic over five years.<br>

<br>

LDC currently offers three versions of OntoNotes:<br>

<br>

<a

 href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2007T21">LDC2007T21</a>

OntoNotes Release 1.0:  contains 400k words of Chinese newswire data

and

300k words of English newswire data <br>

<br>

<a

 href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2008T04">LDC2008T04</a>

OntoNotes Release 2.0:  adds the following to Release 1.0:  

274k words of Chinese broadcast news data and 200k words of English

broadcast

news data <br>

<br>

<a

 href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2009T24">LDC2009T24</a>

OntoNotes Release 3.0:  adds English and Chinese broadcast conversation

data to Release 2.0.   This release includes 250k words of English

newswire data, 200k of English broadcast news data, 200k words of

English

broadcast conversation material, 250k words of Chinese newswire data,

250k

words of Chinese broadcast news material, 150k words of Chinese

broadcast

conversation data and 200k words of Arabic newswire material.<br>

<br>

All OntoNotes releases are distributed on one DVD and are subject to

shipping and handling fees.  In addition to OntoNotes, LDC distributes

a wide range of free databases. 

These include version 1.0 of the Buckwalter Arabic Morphological

Analyzer,

TimeBank, FactBank, and data sponsored by the TalkBank project.  For

further information, please visit our <a

 href="http://www.ldc.upenn.edu/About/whatsnew.shtml#1">What's New!

What's Free!

Archive</a>.</p>

<p class="MsoNormal">[<a href="#top">

top </a>]<br>

<o:p></o:p></p>

<p class="MsoNormal" style="margin-bottom: 12pt;"><b><br>

</b><!--[if !supportLineBreakNewLine]--><br style="">

<!--[endif]--><o:p></o:p></p>

<p class="MsoNormal" style="text-align: center;" align="center"><b>New

Publications</b><o:p></o:p></p>

<p><a name="bengali"></a>(1) <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T16">Indian

Language Part-of-Speech Tagset: Bengali</a> is a corpus developed by

Microsoft Research (MSR) <st1:country-region><st1:place>India</st1:place></st1:country-region>

to support the task of Part-of-Speech Tagging (POS) and other

data-driven

linguistic research on Indian Languages in general. It is created as a

part of

the <a

 href="http://research.microsoft.com/en-us/groups/mls/default.aspx">Indian

Language Part-of-Speech Tagset (IL-POST)</a> project, a collaborative

effort

among linguists and computer scientists from MSR India,  Anna

Universtiy,

Chennai (AU-KBC), Delhi University,  <st1:stockticker>IIT</st1:stockticker>

Bombay,  Jawaharlal Nehru University (Delhi) and Tamil University

(Tamilnadu). <o:p></o:p></p>

<p>The goal of the IL-POST project is to provide a common tagset

framework for

Indian Languages that offers flexibility, cross-linguistic

compatibility and

resuability across those languages. It supports a three-level hierarchy

of

Categories, Types and Attributes. The corpus mainly consists therefore

of two

different levels of information for each lexical token: (a) lexical

Category

and Types, and (b) set morphological attributes and their associated

values in

the context.<o:p></o:p></p>

<p>Bengali (also referred to as Bangla) is a member of the <span

 class="mw-redirect">Eastern Indo-Aryan language group. It is native to

the region

of </span><st1:place><span class="mw-redirect">Bengal</span></st1:place><span

 class="mw-redirect"> which consists of </span><st1:country-region><st1:place>Bangladesh</st1:place></st1:country-region>,

the Indian state of <st1:place>West Bengal</st1:place>, and parts of

the Indian

states of Tripura and <st1:country-region><st1:place>Assam</st1:place></st1:country-region>.

It is spoken by more than 210 million people as a first or a second

language

with around 100 million speakers in <st1:country-region><st1:place>Bangladesh</st1:place></st1:country-region>,

about 85 million speakers in <st1:country-region><st1:place>India</st1:place></st1:country-region>,

and others in immigrant communities in the <st1:country-region><st1:place>United

Kingdom</st1:place></st1:country-region>, <st1:country-region><st1:place>USA</st1:place></st1:country-region>

and the <st1:place>Middle East</st1:place>. <o:p></o:p></p>

<p class="MsoNormal">This corpus contains 7168 sentences (102933 words)

of

manually annotated text from modern standard Bengali sources including

blogs, <a href="http://en.wikipedia.org">Wikipedia</a>, <a

 href="http://www.multikulti.org.uk">Multikulti</a> and a portion of

the <a href="http://www.elda.org/catalogue/en/text/W0037.html">EMILLE/CIIL</a>

corpus.

The annotated data is structured into two folders, Bangla1 (3684

sentences,

51091 words) and Bangla2 (3484 sentences, 51842 words), which represent

the two

stages in which the data was annotated. All annotated data is provided

in both

xml and text files. Each data file contains between 3,000-5,000 words.

The XML

file contains metadata about the material, such as language, encoding

and data

size. <o:p></o:p></p>

<p>The Annotation Guidelines for Bangla contain a detailed description

of the

annotation methodology. The <a

 href="../../DOCUME%7E1/elefthea/LOCALS%7E1/Temp/docs/Annotation_Tool_Guideline_1.0.pdf"><span

 style="text-decoration: none;">Annotation Tool Guideline

1.0 </span></a>describes the annotation interface developed for the

IL-POST framework; the tool is not included in this release.<o:p></o:p></p>

<p class="MsoNormal" style="margin-bottom: 12pt;">Non-members may

license this

data by submitting a completed copy of the <a

 href="http://www.ldc.upenn.edu/Catalog/mem_agree/Indian_Language_POS_Tagset_Bengali_License_Agreement.html">Microsoft

Research India License Agreement</a>.  The agreement can be faxed to +1

215 573 2175 or scanned and emailed to this address.  This data is

available

at no <a style="">charge</a><span class="MsoCommentReference"><span

 style="font-size: 8pt;"><span style=""></span></span></span>.<br>

</p>

<p class="MsoNormal" style="margin-bottom: 12pt;">[<a href="#top">

top </a>]<br>

<o:p></o:p></p>

<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"

 align="center">*<o:p></o:p></p>

<p><a name="muc"></a>(2) <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T15">Message

Understanding Conference 7 Timed (MUC7_T)</a> was developed by

researchers at

Jena University Language & Information Engineering (JULIE) Lab,

Friedrich-Schiller-Universität Jena, Germany. It is a re-annotation of

a

portion of the <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001T02">MUC7</a>

corpus (Linguistic Data Consortium, LDC2001T02), which consists of New

York

Times news stories annotated for use in the Message Understanding

Conference 7

(MUC7) evaluation.  The series of MUC evaluations in the 1990s focused

on

emerging information extraction technologies. Further information about

the

MUC7 evaluation can be found here <a

 href="http://www.itl.nist.gov/iaui/894.02/related_projects/muc">here</a>.

<o:p></o:p></p>

<p>MUC7_T consists of 100 articles from the MUC7 corpus training set

reannotated

for named entities (persons, locations and organizations) with a time

stamp

indicating the time measured for the linguistic decision making

process. The

corpus was developed for two principal purposes: for use in evaluations

of

selective sampling strategies, such as Active Learning; and to create

predictive models for annotation costs. The annotation was performed by

two

advanced students of linguistics with good English language skills who

followed

the the original guidelines of the MUC7 named entity task (which can be

found

in the <a href="http://www.ldc.upenn.edu/Catalog/docs/LDC2001T02/">online

documentation</a> for the MUC7 corpus). <o:p></o:p></p>

The data is stored in XML format. There is an

element

anno_example for each annotation example that has the original MUC7

document as

text context. The MUC7 document was tokenized using the Stanford

Tokenizer3

with white spaces marking token boundaries. The tokenizer is part of

the

Stanford Parser package which can be obtained from <a

 href="http://nlp.stanford.edu/software/lex-parser.shtml"

 title="The Stanford Natural Language Processing Group ">The Stanford

Natural

Language Processing Group</a>.<br>

<br>

<br>

[<a href="#top">

top </a>]<br>

<hr size="2" width="100%">

<div align="center">

<pre class="moz-signature" cols="72">Ilya Ahtaridis

Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                  Phone: 1 (215) 573-1275

University of Pennsylvania                    Fax: 1 (215) 573-2175

3600 Market St., Suite 810                        <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                 <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></pre>

</div>

<pre class="moz-signature" cols="72">

</pre>

</body>

</html>