[Corpora-List] News from LDC

Fri Sep 24 16:30:04 UTC 2010

*- Free Copies of OntoNotes Available <#onto>** -*
****
/New Publications:

/LDC2010T16
*- Indian Language Part-of-Speech Tagset: Bengali <#bengali>** -*

LDC2010T15
*- Message Understanding Conference 7 Timed (MUC7_T) <#muc>** -*

*
*

------------------------------------------------------------------------

*Free Copies of OntoNotes Available
*

**LDC is pleased to announce that the OntoNotes data sets are now 
available at no-cost.  The OntoNotes project is a collaborative effort 
between BBN Technologies, the University of Colorado, the University of 
Pennsylvania, and the University of Southern California's Information 
Sciences Institute. The goal of the project is to annotate a large 
corpus comprising various genres of text (news, conversational telephone 
speech, weblogs, use net, broadcast, talk shows) in three languages 
(English, Chinese, and Arabic) with structural information (syntax and 
predicate argument structure) and shallow semantics (word sense linked 
to an ontology and coreference).

OntoNotes builds on and extends two time-tested resources, the Penn 
Treebank 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42> 
for syntax and the Penn PropBank 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T14> 
for predicate-argument structure. Its semantic representation will 
include word sense disambiguation for verbs and some nouns, with many of 
the word senses connected to an ontology, and coreference. The current 
goals call for annotation of over a million words each of English and 
Chinese, and half a million words of Arabic over five years.

LDC currently offers three versions of OntoNotes:

LDC2007T21 
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2007T21> 
OntoNotes Release 1.0:  contains 400k words of Chinese newswire data and 
300k words of English newswire data

LDC2008T04 
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2008T04> 
OntoNotes Release 2.0:  adds the following to Release 1.0:   274k words 
of Chinese broadcast news data and 200k words of English broadcast news 
data

LDC2009T24 
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2009T24> 
OntoNotes Release 3.0:  adds English and Chinese broadcast conversation 
data to Release 2.0.   This release includes 250k words of English 
newswire data, 200k of English broadcast news data, 200k words of 
English broadcast conversation material, 250k words of Chinese newswire 
data, 250k words of Chinese broadcast news material, 150k words of 
Chinese broadcast conversation data and 200k words of Arabic newswire 
material.

All OntoNotes releases are distributed on one DVD and are subject to 
shipping and handling fees.  In addition to OntoNotes, LDC distributes a 
wide range of free databases.  These include version 1.0 of the 
Buckwalter Arabic Morphological Analyzer, TimeBank, FactBank, and data 
sponsored by the TalkBank project.  For further information, please 
visit our What's New! What's Free! Archive 
<http://www.ldc.upenn.edu/About/whatsnew.shtml#1>.

[ top <#top>]

*
*

*New Publications*

(1) Indian Language Part-of-Speech Tagset: Bengali 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T16> 
is a corpus developed by Microsoft Research (MSR) India to support the 
task of Part-of-Speech Tagging (POS) and other data-driven linguistic 
research on Indian Languages in general. It is created as a part of the 
Indian Language Part-of-Speech Tagset (IL-POST) 
<http://research.microsoft.com/en-us/groups/mls/default.aspx> project, a 
collaborative effort among linguists and computer scientists from MSR 
India,  Anna Universtiy, Chennai (AU-KBC), Delhi University,  IIT 
Bombay,  Jawaharlal Nehru University (Delhi) and Tamil University 
(Tamilnadu).

The goal of the IL-POST project is to provide a common tagset framework 
for Indian Languages that offers flexibility, cross-linguistic 
compatibility and resuability across those languages. It supports a 
three-level hierarchy of Categories, Types and Attributes. The corpus 
mainly consists therefore of two different levels of information for 
each lexical token: (a) lexical Category and Types, and (b) set 
morphological attributes and their associated values in the context.

Bengali (also referred to as Bangla) is a member of the Eastern 
Indo-Aryan language group. It is native to the region of Bengal which 
consists of Bangladesh, the Indian state of West Bengal, and parts of 
the Indian states of Tripura and Assam. It is spoken by more than 210 
million people as a first or a second language with around 100 million 
speakers in Bangladesh, about 85 million speakers in India, and others 
in immigrant communities in the United Kingdom, USA and the Middle East.

This corpus contains 7168 sentences (102933 words) of manually annotated 
text from modern standard Bengali sources including blogs, Wikipedia 
<http://en.wikipedia.org>, Multikulti <http://www.multikulti.org.uk> and 
a portion of the EMILLE/CIIL 
<http://www.elda.org/catalogue/en/text/W0037.html> corpus. The annotated 
data is structured into two folders, Bangla1 (3684 sentences, 51091 
words) and Bangla2 (3484 sentences, 51842 words), which represent the 
two stages in which the data was annotated. All annotated data is 
provided in both xml and text files. Each data file contains between 
3,000-5,000 words. The XML file contains metadata about the material, 
such as language, encoding and data size.

The Annotation Guidelines for Bangla contain a detailed description of 
the annotation methodology. The Annotation Tool Guideline 1.0  
<../../DOCUME%7E1/elefthea/LOCALS%7E1/Temp/docs/Annotation_Tool_Guideline_1.0.pdf>describes 
the annotation interface developed for the IL-POST framework; the tool 
is not included in this release.

Non-members may license this data by submitting a completed copy of the 
Microsoft Research India License Agreement 
<http://www.ldc.upenn.edu/Catalog/mem_agree/Indian_Language_POS_Tagset_Bengali_License_Agreement.html>.  
The agreement can be faxed to +1 215 573 2175 or scanned and emailed to 
this address.  This data is available at no charge.

[ top <#top>]

*

(2) Message Understanding Conference 7 Timed (MUC7_T) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T15> 
was developed by researchers at Jena University Language & Information 
Engineering (JULIE) Lab, Friedrich-Schiller-Universität Jena, Germany. 
It is a re-annotation of a portion of the MUC7 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001T02> 
corpus (Linguistic Data Consortium, LDC2001T02), which consists of New 
York Times news stories annotated for use in the Message Understanding 
Conference 7 (MUC7) evaluation.  The series of MUC evaluations in the 
1990s focused on emerging information extraction technologies. Further 
information about the MUC7 evaluation can be found here here 
<http://www.itl.nist.gov/iaui/894.02/related_projects/muc>.

MUC7_T consists of 100 articles from the MUC7 corpus training set 
reannotated for named entities (persons, locations and organizations) 
with a time stamp indicating the time measured for the linguistic 
decision making process. The corpus was developed for two principal 
purposes: for use in evaluations of selective sampling strategies, such 
as Active Learning; and to create predictive models for annotation 
costs. The annotation was performed by two advanced students of 
linguistics with good English language skills who followed the the 
original guidelines of the MUC7 named entity task (which can be found in 
the online documentation 
<http://www.ldc.upenn.edu/Catalog/docs/LDC2001T02/> for the MUC7 corpus).

The data is stored in XML format. There is an element anno_example for 
each annotation example that has the original MUC7 document as text 
context. The MUC7 document was tokenized using the Stanford Tokenizer3 
with white spaces marking token boundaries. The tokenizer is part of the 
Stanford Parser package which can be obtained from The Stanford Natural 
Language Processing Group 
<http://nlp.stanford.edu/software/lex-parser.shtml>.

[ top <#top>]
------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810                        ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                 http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100924/5dce3f77/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora