[Corpora-List] New from the LDC

Tue Nov 25 22:51:49 UTC 2008

*LDC Spoken Language Sampler Available for Free Download* 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S08>

LDC2008S09
*-  CHAracterizing INdividual Speakers (CHAINS) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S09>  -*

LDC2008T20
*-  **PennBioIE CYP 1.0* 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T20>*  
-*

LDC2008T21
*-  PennBioIE Oncology 1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T21>**  
-

**The Linguistic Data Consortium (LDC) would like to announce the 
availability of a free spoken language sampler as well as the release of 
three new publications.*

*
*
------------------------------------------------------------------------

*
*
*LDC Spoken Language Sampler Available for Free Download*

The LDC Spoken Language Sampler 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S08> 
provides a variety of speech, transcript and lexicon samples and is 
designed to illustrate the variety and breadth of the resources 
available from LDC's Catalog.  Created for distribution at NWAV 37 and 
geared towards sociolinguists, the sampler is a good introduction to 
data available from the LDC. The sampler includes excerpts from 
telephone conversations in Arabic (Gulf, Iraqi, and Levantine dialects) 
Farsi, Japanese, Korean, Spanish, and Tamil; dictionary resources for 
Mawukakan and Tamil; transcribed meeting speech; utterances in Russian 
from native and non-native speakers; and speech samples which represent 
regional accents and dialects of the United States.  Audio samples range 
from 30 seconds to 90 seconds and are accompanied by transcripts.

The sampler can be downloaded for free from the catalog page for the LDC 
Spoken Language Sampler 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S08>.  
Please scroll down to 'How to Obtain' for a download link.

*
*
*New Publications*
*
*

(1) CHAracterizing INdividual Speakers (CHAINS) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S09> 
contains recordings of thirty-six English speakers reading fables and 
selected sentences in different speaking styles. The data was obtained 
in two different sessions with a time separation of about two months. 
The goal of the corpus is to provide a range of speaking styles and 
voice modifi
cations for speakers sharing the same accentOther existing 
corpora, in particular CSLU Speaker Recognition Version 1.1 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S26>, 
TIMIT 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1> 
and the IViE corpus <http://www.phon.ox.ac.uk/IViE/> (English Intonation 
in the British Isles), served as referents in the selection of material. 
This design decision was made to ensure that methods designed and 
evaluated on the CHAINS corpus might be directly testable on these other 
corpora, which were recorded using quite different dialects and channel 
characteristics.

The data was collected in two recording sessions in a total of six 
different speaking styles:

    * solo reading
    * synchronous reading
    * spontaneous speech ("retell")
    * repetitive synchronous imitation ("rsi")
    * whispered fast reading
    * fast speech reading

In two of the speaking conditions adopted, speakers modifi
ed their 
speech in a constrained fashion towards a known target; in the 
synchronous condition, the speech of the co-speaker served as a target, 
while in rsi, there was an explicit known static target. The presence of 
a known target which speakers aim to copy raises the bar in the 
discovery and design of procedures for automatic speaker identi
cation, 
as the target speech provides a potentially highly confusing foil. The 
whisper and fast speech conditions are also well defi
ned speaking 
styles which require substantial voice modification by the speaker.

***

(2) The PennBioIE CYP 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T20> 
corpus consists of 1100 PubMed 
<http://www.ncbi.nlm.nih.gov/entrez/query.fcgi> abstracts on the 
inhibition of cytochrome P450 enzymes.  The  abstracts comprise 
approximately 313,000 total words of text. Each file has been tokenized 
and its biomedical portions (274,000 total words) exhaustively annotated 
for paragraph, sentence, and part of speech, and non-exhaustively 
annotated for 5 types of biomedical named entity in three categories of 
interest. 324 of the abstracts have also been syntactically annotated.

Annotation at all layers except entity is based on the Penn Treebank II 
guidelines <ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/>, with a 
number of modifications that have been found necessary, many of which 
were subsequently adopted by the Penn Treebank. Entity definitions came 
originally from domain experts and were developed and refined in 
dialogue with the annotators. All annotation is standoff: the source 
text is never modified, annotations being made in a separate file.  
Paragraph, sentence, tokenization, POS, and syntactic annotation 
(treebanking) are applied by automatic taggers and manually corrected; 
entity annotation is manual.

*

(3)  The PennBioIE Oncology 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T21> 
corpus consists of 1414 PubMed 
<http://www.ncbi.nlm.nih.gov/entrez/query.fcgi> abstracts on cancer, 
concentrating on molecular genetics.  The abstracts comprise 
approximately 381,000 total words of text. Each file has been tokenized 
and its biomedical portions (327,000 total words) exhaustively annotated 
for paragraph, sentence, and part of speech, and non-exhaustively 
annotated for 16 ("Level 1") or 23 ("Level 2") types of named entity. 
318 of the abstracts have also been syntactically annotated.

Annotation at all layers except entity is based on the Penn Treebank II 
guidelines <ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/>, with a 
number of modifications that have been found necessary, many of which 
were subsequently adopted by the Penn Treebank. Entity definitions came 
originally from domain experts and were developed and refined in 
dialogue with the annotators. All annotation is standoff: the source 
text is never modified, annotations being made in a separate file.  
Paragraph, sentence, tokenization, POS, and syntactic annotation 
(treebanking) are applied by automatic taggers and manually corrected; 
entity annotation is manual.

The oncology data comprises two subcorpora:

    * The Sanger subcorpus /(san)/ consists of abstracts of 577 articles
      previously annotated by the Sanger Institute for global mention of
      oncological named entities. These annotations were metadata
      reflecting the presence or absence of such mentions anywhere in
      the text. The articles concentrate on variations in a small set of
      human genes associated with many different types of cancer. We did
      not refer to the Sanger annotations after selection of the abstracts.
    * The neuroblastoma subcorpus /(nb)/ consists of 837 abstracts of
      articles dealing with this particular type of cancer selected by
      colleagues at Children's Hospital of Philadelphia. They do not all
      concentrate on genetics, but they mention a much larger number of
      genes than the Sanger files do.

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
 Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20081125/397d934c/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora