[Corpora-List] News from LDC

Tue Feb 22 21:24:24 UTC 2011

/In this newsletter:/*
*

*- **Publications Pipeline for 2011 <#pipeline>**   -***

/New free publications:/*
*

*- ****Indian Language Part-of-Speech Tagset: Sanskrit <#pos>**  -***

*- **OntoNotes 4.0 <#onto>****-*

------------------------------------------------------------------------

*Publications Pipeline for 2011*

LDC is pleased to provide the following information on our planned 
releases for Membership Year 2011 (MY2011) and would like to remind our 
data users that there is still time to save on membership fees for 
MY2011, but time is quickly running out!   Any organization which joins 
or renews membership for 2011 through Tuesday, March 1, 2011, is 
entitled to a 5% discount on membership fees.  Organizations which held 
membership for MY2010 can receive a 10% discount on fees provided they 
renew prior to March 1, 2011.

Many publications for MY2011 are still in development, but we plan to 
release updates to some of our popular Gigaword corpora as well as new 
speech corpora.  Please note that the list is tentative and subject to 
modifications.  Our planned publications for this year include:

    /2005 NIST Speaker Recognition Evaluation/ - the 2005 data from the
    ongoing series of yearly evaluations conducted by NIST (National
    Institute of Standards and Technology). These evaluations provide an
    important contribution to the direction of research efforts and the
    calibration of technical capabilities. They are intended to be of
    interest to all researchers working on the general problem of
    text-independent speaker recognition.

    /Arabic Gigaword Fifth Edition/ ~ LDC's Arabic newswire collection
    from 2009 and 2010 as well as the contents of Arabic Gigaword Fourth
    Edition (LDC2009T30).  The news sources represented include Agence
    France Presse, An Nahar, Al Hayat, Al-Quds Al-Arabi, Asharq
    Al-Awsat, Assabah Al- Ahram, Ummah Press and Xinhua News Agency.

    /Chinese Gigaword Fifth Edition /~ LDC's Chinese newswire collection
    from 2009 and 2010 as well as the contents of Chinese Gigaword
    Fourth Edition (LDC2009T27).  The news sources represented include
    Agence France Presse, Central News Agency (Taiwan), Xinhua News
    Agency, Zaobao, People's Liberation Army Daily, People's Daily,
    Guangming Daily and China News Service.

    /Digital Archive of Southern Speech/ ~ a geographical sampling of
    colloquial speech in the Southern United States. Samples of speech
    were collected through interviews of single subjects speaking on a
    variety of common topics like family, the weather, household
    articles and activities, agriculture, and social connections.
    Speakers range in age from 15 to 90, with an average age of 61.

    /English Gigaword Fifth Edition/ ~ LDC's English newswire collection
    from 2009 and 2010 as well as the contents of English Gigaword
    Fourth Edition (LDC2009T13).  The news sources represented include
    Agence France Presse, Associated Press, Central News Agency
    (Taiwan), NY Times, Washington Post, Los Angeles Times and Xinhua
    News Agency.

    /MALACH English/ ~  over 300 hours of English audio recordings of
    interviews conducted under the auspices of the USC Shoah Foundation
    Institute for Visual History and Education and associated
    transcripts produced as part of the Multilingual Access to Large
    Spoken ArCHives (MALACH) project.  The data was collected using
    table microphones.  Recordings are 2-channel, 128 kBps, 44.1 kHz mp2
    files, with a different speaker generally predominant in each channel.

2011 Subscription Members are automatically sent all MY2011 data as it 
is released.  2011 Standard Members are entitled to request 16 corpora 
for free from MY2011.   Non-members may license most data for research use.

*New Free Publications*

(1) Indian Language Part-of-Speech Tagset: Sanskrit 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011T04> 
was developed by Microsoft Research (MSR) India to support the task of 
Part-of-Speech Tagging (POS) and other data-driven linguistic research 
on Indian Languages in general. It is created as a part of the Indian 
Language Part-of-Speech Tagset (IL-POST) 
<http://research.microsoft.com/en-us/groups/mls/default.aspx> project, a 
collaborative effort among linguists and computer scientists from MSR 
India, AU-KBC (Anna University, Chennai), Delhi University, IIT Bombay, 
Jawaharlal Nehru University (Delhi) and Tamil University (Tamilnadu).

The goal of the IL-POST project is to provide a common tagset framework 
for Indian Languages that offers flexibility, cross-linguistic 
compatibility and resuability across those languages. It supports a 
three-level hierarchy of Categories, Types and Attributes. The corpus 
mainly consists therefore of two different levels of information for 
each lexical token: (a) lexical Category and Types, and (b) set 
morphological attributes and their associated values in the context.

This corpus contains 3,703 sentences (57,218 words) of manually 
annotated Sanskrit text selected from the Panchatrantra 
<http://en.wikipedia.org/wiki/Panchatantra> stories, a collection of 
animal fables in verse and prose dating from the third century BCE. All 
annotated data is provided in both xml and text files. The xml files are 
contained in the "XML_files" folder and the text files in the 
"text_files" folder. Each data file contains between 12,000-45,000 
words. The XML file contains metadata about the material, such as 
language, encoding and data size.

Non-members may license this data by submitting a completed copy of the 
Microsoft Research India License Agreement 
<http://www.ldc.upenn.edu/Catalog/nonmem_agree/Indian_Language_POS_Tagset_Sanskrit_License_Agreement.htm>. 
The agreement can be faxed to +1 215 573 2175 or scanned and emailed to 
this address.  This data is available at no charge.

*

(2) OntoNotes Release 4.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011T03> 
was developed as part of the OntoNotes project, a collaborative effort 
between BBN Technologies, the University of Colorado, the University of 
Pennsylvania and the University of California's Information Sciences 
Institute. The goal of the project is to annotate a large corpus 
comprising various genres of text (news, conversational telephone 
speech, weblogs, usenet newsgroups, broadcast, talk shows) in three 
languages (English, Chinese, and Arabic) with structural information 
(syntax and predicate argument structure) and shallow semantics (word 
sense linked to an ontology and coreference).

OntoNotes Release 4.0 contains the content of earlier releases -- 
OntoNotes Release 1.0 LDC2007T21 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T21>,OntoNotes 
Release 2.0 LDC2008T04 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T04> 
and OntoNotes Release 3.0 LDC2009T24 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T24> 
-- and adds newswire, broadcast news, broadcast conversation and web 
data in English and Chinese and newswire data in Arabic. This cumulative 
publication consists of 2.4 million words as follows: 300k words of 
Arabic newswire; 250k words of Chinese newswire, 250k words of Chinese 
broadcast news, 150k words of Chinese broadcast conversation and 150k 
words of Chinese web text; and 600k words of English newswire, 200k word 
of English broadcast news, 200k words of English broadcast conversation 
and 300k words of English web text.

The OntoNotes project builds on two time-tested resources, following the 
Penn Treebank 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42> 
for syntax and the Penn PropBank 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T14> 
for predicate-argument structure. Its semantic representation will 
include word sense disambiguation for nouns and verbs, with each word 
sense connected to an ontology, and coreference.

Documents describing the annotation guidelines and the routines for 
deriving various views of the data from the database are included in the 
documentation directory of this release. The annotation is provided both 
in separate text files for each annotation layer (Treebank, PropBank, 
word sense, etc.) and in the form of an integrated relational database 
with a Python API to provide convenient cross-layer access.

Non-members may request this data by completing a copy of the LDC User 
Agreement for Non-Members 
<http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf>.  
The agreement can be faxed +1 215 573 2175 or scanned and emailed to 
this address.  This data is available at no charge, but is subject to 
non-member shipping and handling fees.

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110222/afa0677c/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora