[Corpora-List] News from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Fri Sep 24 16:30:04 UTC 2010
*- Free Copies of OntoNotes Available <#onto>** -*
****
/New Publications:
/LDC2010T16
*- Indian Language Part-of-Speech Tagset: Bengali <#bengali>** -*
LDC2010T15
*- Message Understanding Conference 7 Timed (MUC7_T) <#muc>** -*
*
*
------------------------------------------------------------------------
*Free Copies of OntoNotes Available
*
**LDC is pleased to announce that the OntoNotes data sets are now
available at no-cost. The OntoNotes project is a collaborative effort
between BBN Technologies, the University of Colorado, the University of
Pennsylvania, and the University of Southern California's Information
Sciences Institute. The goal of the project is to annotate a large
corpus comprising various genres of text (news, conversational telephone
speech, weblogs, use net, broadcast, talk shows) in three languages
(English, Chinese, and Arabic) with structural information (syntax and
predicate argument structure) and shallow semantics (word sense linked
to an ontology and coreference).
OntoNotes builds on and extends two time-tested resources, the Penn
Treebank
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42>
for syntax and the Penn PropBank
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T14>
for predicate-argument structure. Its semantic representation will
include word sense disambiguation for verbs and some nouns, with many of
the word senses connected to an ontology, and coreference. The current
goals call for annotation of over a million words each of English and
Chinese, and half a million words of Arabic over five years.
LDC currently offers three versions of OntoNotes:
LDC2007T21
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2007T21>
OntoNotes Release 1.0: contains 400k words of Chinese newswire data and
300k words of English newswire data
LDC2008T04
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2008T04>
OntoNotes Release 2.0: adds the following to Release 1.0: 274k words
of Chinese broadcast news data and 200k words of English broadcast news
data
LDC2009T24
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2009T24>
OntoNotes Release 3.0: adds English and Chinese broadcast conversation
data to Release 2.0. This release includes 250k words of English
newswire data, 200k of English broadcast news data, 200k words of
English broadcast conversation material, 250k words of Chinese newswire
data, 250k words of Chinese broadcast news material, 150k words of
Chinese broadcast conversation data and 200k words of Arabic newswire
material.
All OntoNotes releases are distributed on one DVD and are subject to
shipping and handling fees. In addition to OntoNotes, LDC distributes a
wide range of free databases. These include version 1.0 of the
Buckwalter Arabic Morphological Analyzer, TimeBank, FactBank, and data
sponsored by the TalkBank project. For further information, please
visit our What's New! What's Free! Archive
<http://www.ldc.upenn.edu/About/whatsnew.shtml#1>.
[ top <#top>]
*
*
*New Publications*
(1) Indian Language Part-of-Speech Tagset: Bengali
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T16>
is a corpus developed by Microsoft Research (MSR) India to support the
task of Part-of-Speech Tagging (POS) and other data-driven linguistic
research on Indian Languages in general. It is created as a part of the
Indian Language Part-of-Speech Tagset (IL-POST)
<http://research.microsoft.com/en-us/groups/mls/default.aspx> project, a
collaborative effort among linguists and computer scientists from MSR
India, Anna Universtiy, Chennai (AU-KBC), Delhi University, IIT
Bombay, Jawaharlal Nehru University (Delhi) and Tamil University
(Tamilnadu).
The goal of the IL-POST project is to provide a common tagset framework
for Indian Languages that offers flexibility, cross-linguistic
compatibility and resuability across those languages. It supports a
three-level hierarchy of Categories, Types and Attributes. The corpus
mainly consists therefore of two different levels of information for
each lexical token: (a) lexical Category and Types, and (b) set
morphological attributes and their associated values in the context.
Bengali (also referred to as Bangla) is a member of the Eastern
Indo-Aryan language group. It is native to the region of Bengal which
consists of Bangladesh, the Indian state of West Bengal, and parts of
the Indian states of Tripura and Assam. It is spoken by more than 210
million people as a first or a second language with around 100 million
speakers in Bangladesh, about 85 million speakers in India, and others
in immigrant communities in the United Kingdom, USA and the Middle East.
This corpus contains 7168 sentences (102933 words) of manually annotated
text from modern standard Bengali sources including blogs, Wikipedia
<http://en.wikipedia.org>, Multikulti <http://www.multikulti.org.uk> and
a portion of the EMILLE/CIIL
<http://www.elda.org/catalogue/en/text/W0037.html> corpus. The annotated
data is structured into two folders, Bangla1 (3684 sentences, 51091
words) and Bangla2 (3484 sentences, 51842 words), which represent the
two stages in which the data was annotated. All annotated data is
provided in both xml and text files. Each data file contains between
3,000-5,000 words. The XML file contains metadata about the material,
such as language, encoding and data size.
The Annotation Guidelines for Bangla contain a detailed description of
the annotation methodology. The Annotation Tool Guideline 1.0
<../../DOCUME%7E1/elefthea/LOCALS%7E1/Temp/docs/Annotation_Tool_Guideline_1.0.pdf>describes
the annotation interface developed for the IL-POST framework; the tool
is not included in this release.
Non-members may license this data by submitting a completed copy of the
Microsoft Research India License Agreement
<http://www.ldc.upenn.edu/Catalog/mem_agree/Indian_Language_POS_Tagset_Bengali_License_Agreement.html>.
The agreement can be faxed to +1 215 573 2175 or scanned and emailed to
this address. This data is available at no charge.
[ top <#top>]
*
(2) Message Understanding Conference 7 Timed (MUC7_T)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T15>
was developed by researchers at Jena University Language & Information
Engineering (JULIE) Lab, Friedrich-Schiller-Universität Jena, Germany.
It is a re-annotation of a portion of the MUC7
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001T02>
corpus (Linguistic Data Consortium, LDC2001T02), which consists of New
York Times news stories annotated for use in the Message Understanding
Conference 7 (MUC7) evaluation. The series of MUC evaluations in the
1990s focused on emerging information extraction technologies. Further
information about the MUC7 evaluation can be found here here
<http://www.itl.nist.gov/iaui/894.02/related_projects/muc>.
MUC7_T consists of 100 articles from the MUC7 corpus training set
reannotated for named entities (persons, locations and organizations)
with a time stamp indicating the time measured for the linguistic
decision making process. The corpus was developed for two principal
purposes: for use in evaluations of selective sampling strategies, such
as Active Learning; and to create predictive models for annotation
costs. The annotation was performed by two advanced students of
linguistics with good English language skills who followed the the
original guidelines of the MUC7 named entity task (which can be found in
the online documentation
<http://www.ldc.upenn.edu/Catalog/docs/LDC2001T02/> for the MUC7 corpus).
The data is stored in XML format. There is an element anno_example for
each annotation example that has the original MUC7 document as text
context. The MUC7 document was tokenized using the Stanford Tokenizer3
with white spaces marking token boundaries. The tokenizer is part of the
Stanford Parser package which can be obtained from The Stanford Natural
Language Processing Group
<http://nlp.stanford.edu/software/lex-parser.shtml>.
[ top <#top>]
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100924/5dce3f77/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list