[Corpora-List] New from the LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Wed May 28 19:44:59 UTC 2008
LDC2008T07*
**Chinese Proposition Bank 2.0 (CPB2.0)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T07>
*
LDC2008L02*
**Hindi WordNet
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008L02>
*LDC2008S04
*West Point Brazilian Portuguese Speech*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S04>
*The Linguistic Data Consortium (LDC) would like to announce the
availability of three new publications.
*
------------------------------------------------------------------------
*New Publications
*
(1) Chinese Proposition Bank 2.0 (CPB2.0)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T07>
is a continuation of the Chinese Proposition Bank project
<http://verbs.colorado.edu/chinese/cpb>, which aims to create a corpus
of Chinese text annotated with information about basic semantic
propositions. Chinese Proposition Bank 1.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T23>
consists of predicate-argument annotation on 250,000 words from Chinese
Treebank 5.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T01>.
Chinese Proposition Bank 2.0 adds predicate-argument annotation on
500,000 words from Chinese Treebank 6.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T36>.
The data sources include newswire from Xinhua News Agency, articles from
Sinorama Magazine, news from the website of the Hong Kong Special
Administrative Region and transcripts from various Chinese broadcast
news programs.
This release contains the predicate-argument annotation of 81,009 verb
instances (11,171 unique verbs) and 14,525 noun instances (1,421 unique
nouns). The annotation of nouns is limited to nominalizations that have
a corresponding verb. The general annotation guidelines and the lexical
guidelines (called frame files) for each verbal and nominal predicate
are included in this release.
***
(2) Hindi WordNet
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008L02>
was developed by researchers at the Center for Indian Language
Technology, Computer Science and Engineering Department, IIT Bombay.
Wordnets are systems for analyzing the different lexical and semantic
relations between words. Specifically, a wordnet is a word sense network
in which words are grouped into semantically equivalent units called
synsets. Each synset represents a lexical concept, and synsets are
linked to each other by semantic relations (between synsets) and lexical
relations (between words). Similar in design to the Princeton Wordnet
<http://wordnet.princeton.edu/> for English, Hindi Wordnet incorporates
additional features to capture the complexities of Hindi. This release
of Hindi Wordnet consists of 56,928 unique words and 26,208 synsets.
Additional information about the development of Hindi Wordnet is
available at the Hindi WordNet
<http://www.cfilt.iitb.ac.in/wordnet/webhwn/>web site.
Hindi WordNet contains nouns, verbs, adjectives and adverbs. Each entry
consists of the following elements:
1. Synset: a set of synonymous words. The words in the synset are
arranged according to the frequency of usage.
2. Gloss: the concept. It consists of two parts:
/Text definition/: explains the concept denoted by the synset.
/Example sentence/: gives the usage of the words in the sentence.
3. Position in Ontology: An ontology is a hierarchical organization
of concepts, or more specifically, a categorization of entities and
actions. A separate ontological hierarchy exists for each syntactic
category (noun, verb, adjective adverb). Each synset is mapped into some
place in the ontology..
This release of Hindi WordNet is made available as a complete Java
application along with an API to facilitate further development.
***
(3) West Point Brazilian Portuguese Speech
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S04>
is a database of digital recordings of spoken Brazilian Portuguese
designed and collected by staff and faculty of the Department of Foreign
Languages (DFL) and Center for Technology Enhanced Language Learning
(CTELL) to develop acoustic models for speech recognition systems. The
U.S. government uses such systems to provide speech-recognition enhanced
language learning course ware to government linguists and students
enrolled in various government language programs.
The data in this corpus was collected in March 1999 in Brasilia, Brazil
using informants from a Brazilian military academy. The corpus consists
of read speech from 60 female and 68 male native and non-native
speakers. The speech was elicited from a prompt script containing 296
sentences and phrases typically used in language learning situations.
The speech was collected using four laptop computers running MS Windows.
Three of the computers recorded with a 16 bit data size and sampling
rate of 22050 Hz, the other laptop recorded with an 8 bit data size at a
sampling rate of 11025 Hz. The recording script presented a visual
display of the sentence to be recorded. The informant pressed a key and
spoke the sentence. The recording was played back for review, allowing
the utterance to be re-recorded.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080528/306a9db4/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list