<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

</head>

<body bgcolor="#ffffff" text="#000000">

<div align="center"><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S08"><b>LDC

Spoken Language Sampler Available for Free Download</b></a><br>

<br>

LDC2008S09<br>

<b>-  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S09">CHAracterizing

INdividual Speakers (CHAINS)</a>  -</b><br>

<br>

LDC2008T20<br>

<b>-  </b><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T20"><b>PennBioIE

CYP 1.0</b></a><b>  -</b><br>

<br>

LDC2008T21<br>

<b>-  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T21">PennBioIE

Oncology 1.0</a></b><b>  -<br>

<br>

</b><b>The Linguistic Data Consortium (LDC) would like to announce the

availability of a free spoken language sampler as well as the release

of three new publications.</b><br>

</div>

<br>

<b><br>

</b>

<hr size="2" width="100%"><br>

<b><br>

</b>

<div align="center"><b>LDC Spoken Language Sampler Available for Free

Download</b><br>

</div>

<br>

The <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S08">LDC

Spoken Language Sampler</a> provides a variety of speech, transcript

and lexicon samples and is designed to illustrate the variety and

breadth of the resources available from LDC’s Catalog.  Created for

distribution at NWAV 37 and geared towards sociolinguists, the sampler

is a good introduction to data available from the LDC. The sampler

includes excerpts from telephone conversations in Arabic

(Gulf, Iraqi, and Levantine dialects) Farsi, Japanese, Korean, Spanish,

and Tamil;

dictionary resources for Mawukakan and Tamil; transcribed meeting

speech; utterances in Russian from native and non-native speakers; and

speech samples which represent regional accents and dialects of the

United States.  Audio samples range from 30 seconds to 90 seconds and

are accompanied by transcripts.<br>

<br>

The sampler can be downloaded for free from the catalog page for the <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S08">LDC

Spoken Language Sampler</a>.  Please scroll down to 'How to Obtain' for

a download link.<br>

<br>

<b><br>

</b>

<div align="center"><b>New Publications</b><br>

</div>

<b><br>

</b>

<p>(1) <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S09">CHAracterizing

INdividual Speakers (CHAINS)</a> contains recordings

of thirty-six English speakers reading fables and selected sentences in

different speaking styles. The data was obtained in two different

sessions with a time separation of about two months. The goal of the

corpus is to provide a range of speaking styles and voice

modifications for speakers sharing the same accentOther existing

corpora, in particular <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S26">CSLU

Speaker Recognition Version 1.1</a>, <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1">TIMIT</a>

and <a href="http://www.phon.ox.ac.uk/IViE/">the IViE corpus</a>

(English Intonation in the British Isles), served as referents in the

selection of material. This design decision was made to ensure that

methods designed and evaluated on the CHAINS corpus might be directly

testable on these other corpora, which were recorded using quite

different dialects and channel characteristics. </p>

<p>The data was collected in two recording sessions in a total of six

different speaking styles: </p>

<ul>

  <li>solo reading </li>

  <li>synchronous reading </li>

  <li>spontaneous speech ("retell") </li>

  <li>repetitive synchronous imitation ("rsi") </li>

  <li>whispered fast reading </li>

  <li>fast speech reading </li>

</ul>

<p>In two of the speaking conditions adopted, speakers modified their

speech in a constrained fashion towards a known target; in the

synchronous condition, the speech of the co-speaker served as a target,

while in rsi, there was an explicit known static target. The presence

of a known target which speakers aim to copy raises the bar in the

discovery and design of procedures for automatic speaker identication,

as the target speech provides a potentially highly confusing foil. The

whisper and fast speech conditions are also well defined speaking

styles which require substantial voice modification by the speaker.</p>

<div align="center"><b>*</b><br>

</div>

<p>(2) The <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T20">PennBioIE

CYP</a> corpus consists of 1100 <a

 href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi">PubMed</a>

abstracts on the inhibition of cytochrome P450 enzymes.  The  abstracts

comprise approximately 313,000 total words of text. Each file has been

tokenized and its biomedical portions (274,000 total words)

exhaustively annotated for paragraph, sentence, and part of speech, and

non-exhaustively annotated for 5 types of biomedical named entity in

three categories of interest. 324 of the abstracts have also been

syntactically annotated. </p>

<p>Annotation at all layers except entity is based on the <a

 href="ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/">Penn Treebank

II guidelines</a>, with a number of modifications that have been found

necessary, many of which were subsequently adopted by the Penn

Treebank. Entity definitions came originally from domain experts and

were developed and refined in dialogue with the annotators. All

annotation is standoff: the source text is never modified, annotations

being made in a separate file.  Paragraph, sentence, tokenization, POS,

and syntactic annotation (treebanking) are applied by automatic taggers

and manually corrected; entity annotation is manual.<br>

<br>

</p>

<p> </p>

<div align="center">*<br>

</div>

<p>(3)  The <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T21">PennBioIE

Oncology</a> corpus consists of 1414 <a

 href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi">PubMed</a>

abstracts on cancer, concentrating on molecular genetics.  The

abstracts comprise approximately 381,000 total words of text. Each file

has been

tokenized and its biomedical portions (327,000 total words)

exhaustively annotated for paragraph, sentence, and part of speech, and

non-exhaustively annotated for 16 ("Level 1") or 23 ("Level 2") types

of named entity. 318 of the abstracts have also been syntactically

annotated. </p>

<p>Annotation at all layers except entity is based on the <a

 href="ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/">Penn Treebank

II guidelines</a>, with a number of modifications that have been found

necessary, many of which were subsequently adopted by the Penn

Treebank. Entity definitions came originally from domain experts and

were developed and refined in dialogue with the annotators. All

annotation is standoff: the source text is never modified, annotations

being made in a separate file.  Paragraph, sentence, tokenization, POS,

and syntactic annotation (treebanking) are applied by automatic taggers

and manually corrected; entity annotation is manual.</p>

<p> </p>

<p>The oncology data comprises two subcorpora: </p>

<ul>

  <li>The Sanger subcorpus <i>(san)</i> consists of abstracts of 577

articles previously annotated by the Sanger Institute for global

mention of oncological named entities. These annotations were metadata

reflecting the presence or absence of such mentions anywhere in the

text. The articles concentrate

on variations in a small set of human genes associated with many

different types of cancer. We did not refer to

the Sanger annotations after selection of the abstracts. </li>

  <li>The neuroblastoma subcorpus <i>(nb)</i> consists of 837

abstracts of articles dealing with this particular type of cancer

selected by colleagues at Children's Hospital of Philadelphia. They do

not all concentrate on genetics, but they mention a much larger number

of genes than the Sanger files do.</li>

</ul>

<hr size="2" width="100%"><br>

<div align="center"><font face="Courier New, Courier, monospace"><small><small><big>Ilya

Ahtaridis<br>

Membership Coordinator</big><br>

<br>

</small>--------------------------------------------------------------------</small><small><br>

</small></font></div>

<div align="center">

<pre class="moz-signature" cols="72"><font

 face="Courier New, Courier, monospace">Linguistic Data Consortium                     Phone: (215) 573-1275

University of Pennsylvania                       Fax: (215) 573-2175

3600 Market St., Suite 810                         <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

 Philadelphia, PA 19104 USA                   <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>

</div>

<pre class="moz-signature" cols="72">

</pre>

</body>

</html>