[Corpora-List] New from the LDC

Wed May 28 19:44:59 UTC 2008

LDC2008T07*
**Chinese Proposition Bank 2.0 (CPB2.0) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T07>
*

LDC2008L02*
**Hindi WordNet 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008L02>

*LDC2008S04
*West Point Brazilian Portuguese Speech* 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S04>

*The Linguistic Data Consortium (LDC) would like to announce the 
availability of three new publications.

*

------------------------------------------------------------------------

*New Publications

*

(1) Chinese Proposition Bank 2.0 (CPB2.0) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T07> 
is a continuation of the Chinese Proposition Bank project 
<http://verbs.colorado.edu/chinese/cpb>, which aims to create a corpus 
of Chinese text annotated with information about basic semantic 
propositions. Chinese Proposition Bank 1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T23> 
consists of predicate-argument annotation on 250,000 words from Chinese 
Treebank 5.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T01>. 
Chinese Proposition Bank 2.0 adds predicate-argument annotation on 
500,000 words from Chinese Treebank 6.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T36>. 
The data sources include newswire from Xinhua News Agency, articles from 
Sinorama Magazine, news from the website of the Hong Kong Special 
Administrative Region and transcripts from various Chinese broadcast 
news programs.

This release contains the predicate-argument annotation of 81,009 verb 
instances (11,171 unique verbs) and 14,525 noun instances (1,421 unique 
nouns). The annotation of nouns is limited to nominalizations that have 
a corresponding verb. The general annotation guidelines and the lexical 
guidelines (called frame files) for each verbal and nominal predicate 
are included in this release. 

***

(2)  Hindi WordNet 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008L02> 
was developed by researchers at the Center for Indian Language 
Technology, Computer Science and Engineering Department, IIT Bombay.  
Wordnets are systems for analyzing the different lexical and semantic 
relations between words. Specifically, a wordnet is a word sense network 
in which words are grouped into semantically equivalent units called 
synsets. Each synset represents a lexical concept, and synsets are 
linked to each other by semantic relations (between synsets) and lexical 
relations (between words). Similar in design to the Princeton Wordnet 
<http://wordnet.princeton.edu/> for English, Hindi Wordnet incorporates 
additional features to capture the complexities of Hindi. This release 
of Hindi Wordnet consists of 56,928 unique words and 26,208 synsets.

Additional information about the development of Hindi Wordnet is 
available at the Hindi WordNet 
<http://www.cfilt.iitb.ac.in/wordnet/webhwn/>web site.

Hindi WordNet contains nouns, verbs, adjectives and adverbs. Each entry 
consists of the following elements:

1.      Synset: a set of synonymous words. The words in the synset are 
arranged according to the frequency of usage.

2.      Gloss: the concept. It consists of two parts:

/Text definition/: explains the concept denoted by the synset. 

/Example sentence/: gives the usage of the words in the sentence.

3.      Position in Ontology: An ontology is a hierarchical organization 
of concepts, or more specifically, a categorization of entities and 
actions. A separate ontological hierarchy exists for each syntactic 
category (noun, verb, adjective adverb). Each synset is mapped into some 
place in the ontology..

This release of Hindi WordNet is made available as a complete Java 
application along with an API to facilitate further development. 

***

(3) West Point Brazilian Portuguese Speech 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S04> 
is a database of digital recordings of spoken Brazilian Portuguese 
designed and collected by staff and faculty of the Department of Foreign 
Languages (DFL) and Center for Technology Enhanced Language Learning 
(CTELL) to develop acoustic models for speech recognition systems. The 
U.S. government uses such systems to provide speech-recognition enhanced 
language learning course ware to government linguists and students 
enrolled in various government language programs.

The data in this corpus was collected in March 1999 in Brasilia, Brazil 
using informants from a Brazilian military academy. The corpus consists 
of read speech from 60 female and 68 male native and non-native 
speakers.  The speech was elicited from a prompt script containing 296 
sentences and phrases typically used in language learning situations.

The speech was collected using four laptop computers running MS Windows. 
Three of the computers recorded with a 16 bit data size and sampling 
rate of 22050 Hz, the other laptop recorded with an 8 bit data size at a 
sampling rate of 11025 Hz. The recording script presented a visual 
display of the sentence to be recorded. The informant pressed a key and 
spoke the sentence. The recording was played back for review, allowing 
the utterance to be re-recorded.

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
 Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080528/306a9db4/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora