[Corpora-List] LDC News

Wed Oct 5 20:21:30 UTC 2005

*  Free Talkbank Corpora Still Available!*

LDC2005T33
*BBN Pronoun Coreference and Entity Type Corpus 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T33>*

LDC2005T23
*Chinese Proposition Bank 1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T23>*

LDC2005S25
*Santa Barbara Corpus of Spoken American English Part-IV 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S25>*

The Linguistic Data Consortium would like to announce the availability 
of free Talkbank data and of three new corpora.

------------------------------------------------------------------------

TalkBank <http://www.talkbank.org/> is an indisciplinary research 
project funded by a five year NSF grant to foster research and 
development in communicative behavior by providing tools and standards 
for analysis and distribution of language data.  The LDC distributes the 
following Talkbank corpora:

  LDC2003V01  FORM2 Kinematic Gesture 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003V01>  
-  gesture annotation scheme designed to capture the kinematic 
information in gesture from videos of speakers

  LDC2003L01  Grassfields Bantu Fieldwork: Dschang Lexicon 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003L01>  
- spoken lexicon with 5000+ sound files

  LDC2003S02  Grassfields Bantu Fieldwork: Dschang Tone Paradigms 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003S02>  
- tone paradigms along with phonetic and tonological transcriptions

  LDC2001S16  Grassfields Bantu Fieldwork: Ngomba Tone Paradigms 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001S16>  
- tone paradigms along with phonetic and tonological transcriptions

  LDC2004L01  Klex: Finite-State Lexical Transducer for Korean 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004L01>  
- for morphological analysis and generation applications

  LDC2004T03  Morphologically Annotated Korean Text 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T03>  
- annotated morphological analysis and part-of-speech tags

  LDC2003T15  SLX Corpus of Classic Sociolinguistic Interviews 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T15> 
- 8 interviews conducted by William Labov, plus transcripts, variable 
survey and annotation tools

  LDC2003S06  Santa Barbara Corpus of Spoken American English Part-II 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003S06>  
- recordings of natural speech from all over U.S.

  LDC2004S10  Santa Barbara Corpus of Spoken American English III 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S10> 
- recordings of natural speech from all over U.S.

  LDC2005S25  Santa Barbara Corpus of Spoken American English Part-IV 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S25> 
- over 5 hours of recordings of natural speech from all over U.S.

  LDC2004S12  Talkbank Ethology Data: Field Recordings of Vervet Monkey 
Calls 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S12> 
- 60 recordings with corresponding annotations

Grant-sponsored copies for all of the above corpora are still 
available.  Shipping and handling charges apply.  Please contact the LDC 
to learn if your organizaiton is eligle to receive a free copy.

*

BBN Pronoun Coreference and Entity Type Corpus 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T33> 
supplements the 1 million word Penn Treebank corpus of Wall Street 
Journal texts (LDC95T7). The corpus contains stand-off annotation of 
pronoun coreference, indicated by sentence and token numbers, as well as 
annotation of a variety of entity and numeric types. All annotation was 
done by hand at BBN using proprietary annotation tools. This corpus was 
developed by BBN to support the ACE and AQUAINT programs

The corpus contains two components:

    *

      Pronoun coreference. Stand-off annotation of pronoun coreference
      of the WSJ corpus is provided in a single file. Pronouns and
      antecedents are indexed by sentence and token numbers.

    *

      Entity types. The corpus includes annotation of 12 named entity
      types (Person, Facility, Organization, GPE, Location, Nationality,
      Product, Event, Work of Art, Law, Language, and Contact-Info),
      nine nominal entity types (Person, Facility, Organization, GPE,
      Product, Plant, Animal, Substance, Disease and Game), and seven
      numeric types (Date, Time, Percent, Money, Quantity, Ordinal and
      Cardinal). Several of these types are further divided into
      subtypes. Annotation for a total of 64 subtypes is provided.

*

Chinese Proposition Bank 1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T23> 
is the first public release of the Penn Chinese Proposition Bank 
project, which aims to create a corpus of text annotated with 
information about basic semantic propositions. Specifically, 
predicate-argument relations have been added to the syntactic trees of 
Chinese Treebank 5.1 as an additional layer of annotation.

Chinese Proposition Bank 1.0 includes annotations of the first 250K 
words of the Chinese TreeBank 5.1.  There are a total of 37,183 
propositions. Auxiliary verbs are not annotated. Some verbs have light 
verb and non-light verbs uses and in these cases only the non-light 
verbs are annotated. All the annotations in this release are the result 
of double blind annotation followed by adjudication of differences. 

*

Santa Barbara Corpus of Spoken American English Part-IV 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S25> 
is based on hundreds of recordings of natural speech from all over the 
United States, representing a wide variety of people of different 
regional origins, ages, occupations, and ethnic and social backgrounds. 
It reflects many ways that people use language in their lives: 
conversation, gossip, arguments, on-the-job talk, card games, city 
council meetings, sales pitches, classroom lectures, political speeches, 
bedtime stories, sermons, weddings, and more.  The corpus was collected 
by theUniversity of California, Santa Barbara Center for the Study of 
Discourse.

The audio data consists of 14 wave format speech files, recorded in 
two-channel pcm, at 22050Hz. The speech files total 5.75 hours of audio 
(1.5 GB), representing over 58000 words and over 6000 unique words in 
the transcribed text. 

The cost of the first 100 copies of this publication (not counting the 
copies distributed to LDC members) is covered by NSF Grant Number 
BCS-998009, and therefore free of charge to qualified researchers; a $30 
shipping and handling fee applies. After these first 100 copies are 
distributed, additional copies will be available for the production cost 
of $200 per DVD-ROM.

------------------------------------------------------------------------

If you need further information, or would like to inquire about 
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215 573 
2175.

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
3600 Market Street                             Fax:   (215) 573-2175
Suite 810                             	    	   ldc at ldc.upenn.edu
Philadelphia, PA 19104                 	    http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20051005/3fae6dbb/attachment.htm>