[Corpora-List] LDC News
Linguistic Data Consortium
ldc at ldc.upenn.edu
Wed Oct 5 20:21:30 UTC 2005
* Free Talkbank Corpora Still Available!*
LDC2005T33
*BBN Pronoun Coreference and Entity Type Corpus
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T33>*
LDC2005T23
*Chinese Proposition Bank 1.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T23>*
LDC2005S25
*Santa Barbara Corpus of Spoken American English Part-IV
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S25>*
The Linguistic Data Consortium would like to announce the availability
of free Talkbank data and of three new corpora.
------------------------------------------------------------------------
TalkBank <http://www.talkbank.org/> is an indisciplinary research
project funded by a five year NSF grant to foster research and
development in communicative behavior by providing tools and standards
for analysis and distribution of language data. The LDC distributes the
following Talkbank corpora:
LDC2003V01 FORM2 Kinematic Gesture
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003V01>
- gesture annotation scheme designed to capture the kinematic
information in gesture from videos of speakers
LDC2003L01 Grassfields Bantu Fieldwork: Dschang Lexicon
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003L01>
- spoken lexicon with 5000+ sound files
LDC2003S02 Grassfields Bantu Fieldwork: Dschang Tone Paradigms
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003S02>
- tone paradigms along with phonetic and tonological transcriptions
LDC2001S16 Grassfields Bantu Fieldwork: Ngomba Tone Paradigms
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001S16>
- tone paradigms along with phonetic and tonological transcriptions
LDC2004L01 Klex: Finite-State Lexical Transducer for Korean
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004L01>
- for morphological analysis and generation applications
LDC2004T03 Morphologically Annotated Korean Text
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T03>
- annotated morphological analysis and part-of-speech tags
LDC2003T15 SLX Corpus of Classic Sociolinguistic Interviews
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T15>
- 8 interviews conducted by William Labov, plus transcripts, variable
survey and annotation tools
LDC2003S06 Santa Barbara Corpus of Spoken American English Part-II
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003S06>
- recordings of natural speech from all over U.S.
LDC2004S10 Santa Barbara Corpus of Spoken American English III
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S10>
- recordings of natural speech from all over U.S.
LDC2005S25 Santa Barbara Corpus of Spoken American English Part-IV
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S25>
- over 5 hours of recordings of natural speech from all over U.S.
LDC2004S12 Talkbank Ethology Data: Field Recordings of Vervet Monkey
Calls
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S12>
- 60 recordings with corresponding annotations
Grant-sponsored copies for all of the above corpora are still
available. Shipping and handling charges apply. Please contact the LDC
to learn if your organizaiton is eligle to receive a free copy.
*
BBN Pronoun Coreference and Entity Type Corpus
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T33>
supplements the 1 million word Penn Treebank corpus of Wall Street
Journal texts (LDC95T7). The corpus contains stand-off annotation of
pronoun coreference, indicated by sentence and token numbers, as well as
annotation of a variety of entity and numeric types. All annotation was
done by hand at BBN using proprietary annotation tools. This corpus was
developed by BBN to support the ACE and AQUAINT programs
The corpus contains two components:
*
Pronoun coreference. Stand-off annotation of pronoun coreference
of the WSJ corpus is provided in a single file. Pronouns and
antecedents are indexed by sentence and token numbers.
*
Entity types. The corpus includes annotation of 12 named entity
types (Person, Facility, Organization, GPE, Location, Nationality,
Product, Event, Work of Art, Law, Language, and Contact-Info),
nine nominal entity types (Person, Facility, Organization, GPE,
Product, Plant, Animal, Substance, Disease and Game), and seven
numeric types (Date, Time, Percent, Money, Quantity, Ordinal and
Cardinal). Several of these types are further divided into
subtypes. Annotation for a total of 64 subtypes is provided.
*
Chinese Proposition Bank 1.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T23>
is the first public release of the Penn Chinese Proposition Bank
project, which aims to create a corpus of text annotated with
information about basic semantic propositions. Specifically,
predicate-argument relations have been added to the syntactic trees of
Chinese Treebank 5.1 as an additional layer of annotation.
Chinese Proposition Bank 1.0 includes annotations of the first 250K
words of the Chinese TreeBank 5.1. There are a total of 37,183
propositions. Auxiliary verbs are not annotated. Some verbs have light
verb and non-light verbs uses and in these cases only the non-light
verbs are annotated. All the annotations in this release are the result
of double blind annotation followed by adjudication of differences.
*
Santa Barbara Corpus of Spoken American English Part-IV
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S25>
is based on hundreds of recordings of natural speech from all over the
United States, representing a wide variety of people of different
regional origins, ages, occupations, and ethnic and social backgrounds.
It reflects many ways that people use language in their lives:
conversation, gossip, arguments, on-the-job talk, card games, city
council meetings, sales pitches, classroom lectures, political speeches,
bedtime stories, sermons, weddings, and more. The corpus was collected
by theUniversity of California, Santa Barbara Center for the Study of
Discourse.
The audio data consists of 14 wave format speech files, recorded in
two-channel pcm, at 22050Hz. The speech files total 5.75 hours of audio
(1.5 GB), representing over 58000 words and over 6000 unique words in
the transcribed text.
The cost of the first 100 copies of this publication (not counting the
copies distributed to LDC members) is covered by NSF Grant Number
BCS-998009, and therefore free of charge to qualified researchers; a $30
shipping and handling fee applies. After these first 100 copies are
distributed, additional copies will be available for the production cost
of $200 per DVD-ROM.
------------------------------------------------------------------------
If you need further information, or would like to inquire about
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215 573
2175.
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
3600 Market Street Fax: (215) 573-2175
Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20051005/3fae6dbb/attachment.htm>
More information about the Corpora
mailing list