<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
</head>
<body bgcolor="#ffffff" text="#000000">
<div align="center"><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S08"><b>LDC
Spoken Language Sampler Available for Free Download</b></a><br>
<br>
LDC2008S09<br>
<b>- <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S09">CHAracterizing
INdividual Speakers (CHAINS)</a> -</b><br>
<br>
LDC2008T20<br>
<b>- </b><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T20"><b>PennBioIE
CYP 1.0</b></a><b> -</b><br>
<br>
LDC2008T21<br>
<b>- <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T21">PennBioIE
Oncology 1.0</a></b><b> -<br>
<br>
</b><b>The Linguistic Data Consortium (LDC) would like to announce the
availability of a free spoken language sampler as well as the release
of three new publications.</b><br>
</div>
<br>
<b><br>
</b>
<hr size="2" width="100%"><br>
<b><br>
</b>
<div align="center"><b>LDC Spoken Language Sampler Available for Free
Download</b><br>
</div>
<br>
The <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S08">LDC
Spoken Language Sampler</a> provides a variety of speech, transcript
and lexicon samples and is designed to illustrate the variety and
breadth of the resources available from LDC’s Catalog. Created for
distribution at NWAV 37 and geared towards sociolinguists, the sampler
is a good introduction to data available from the LDC. The sampler
includes excerpts from telephone conversations in Arabic
(Gulf, Iraqi, and Levantine dialects) Farsi, Japanese, Korean, Spanish,
and Tamil;
dictionary resources for Mawukakan and Tamil; transcribed meeting
speech; utterances in Russian from native and non-native speakers; and
speech samples which represent regional accents and dialects of the
United States. Audio samples range from 30 seconds to 90 seconds and
are accompanied by transcripts.<br>
<br>
The sampler can be downloaded for free from the catalog page for the <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S08">LDC
Spoken Language Sampler</a>. Please scroll down to 'How to Obtain' for
a download link.<br>
<br>
<b><br>
</b>
<div align="center"><b>New Publications</b><br>
</div>
<b><br>
</b>
<p>(1) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S09">CHAracterizing
INdividual Speakers (CHAINS)</a> contains recordings
of thirty-six English speakers reading fables and selected sentences in
different speaking styles. The data was obtained in two different
sessions with a time separation of about two months. The goal of the
corpus is to provide a range of speaking styles and voice
modifications for speakers sharing the same accentOther existing
corpora, in particular <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S26">CSLU
Speaker Recognition Version 1.1</a>, <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1">TIMIT</a>
and <a href="http://www.phon.ox.ac.uk/IViE/">the IViE corpus</a>
(English Intonation in the British Isles), served as referents in the
selection of material. This design decision was made to ensure that
methods designed and evaluated on the CHAINS corpus might be directly
testable on these other corpora, which were recorded using quite
different dialects and channel characteristics. </p>
<p>The data was collected in two recording sessions in a total of six
different speaking styles: </p>
<ul>
<li>solo reading </li>
<li>synchronous reading </li>
<li>spontaneous speech ("retell") </li>
<li>repetitive synchronous imitation ("rsi") </li>
<li>whispered fast reading </li>
<li>fast speech reading </li>
</ul>
<p>In two of the speaking conditions adopted, speakers modified their
speech in a constrained fashion towards a known target; in the
synchronous condition, the speech of the co-speaker served as a target,
while in rsi, there was an explicit known static target. The presence
of a known target which speakers aim to copy raises the bar in the
discovery and design of procedures for automatic speaker identication,
as the target speech provides a potentially highly confusing foil. The
whisper and fast speech conditions are also well defined speaking
styles which require substantial voice modification by the speaker.</p>
<div align="center"><b>*</b><br>
</div>
<p>(2) The <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T20">PennBioIE
CYP</a> corpus consists of 1100 <a
href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi">PubMed</a>
abstracts on the inhibition of cytochrome P450 enzymes. The abstracts
comprise approximately 313,000 total words of text. Each file has been
tokenized and its biomedical portions (274,000 total words)
exhaustively annotated for paragraph, sentence, and part of speech, and
non-exhaustively annotated for 5 types of biomedical named entity in
three categories of interest. 324 of the abstracts have also been
syntactically annotated. </p>
<p>Annotation at all layers except entity is based on the <a
href="ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/">Penn Treebank
II guidelines</a>, with a number of modifications that have been found
necessary, many of which were subsequently adopted by the Penn
Treebank. Entity definitions came originally from domain experts and
were developed and refined in dialogue with the annotators. All
annotation is standoff: the source text is never modified, annotations
being made in a separate file. Paragraph, sentence, tokenization, POS,
and syntactic annotation (treebanking) are applied by automatic taggers
and manually corrected; entity annotation is manual.<br>
<br>
</p>
<p> </p>
<div align="center">*<br>
</div>
<p>(3) The <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T21">PennBioIE
Oncology</a> corpus consists of 1414 <a
href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi">PubMed</a>
abstracts on cancer, concentrating on molecular genetics. The
abstracts comprise approximately 381,000 total words of text. Each file
has been
tokenized and its biomedical portions (327,000 total words)
exhaustively annotated for paragraph, sentence, and part of speech, and
non-exhaustively annotated for 16 ("Level 1") or 23 ("Level 2") types
of named entity. 318 of the abstracts have also been syntactically
annotated. </p>
<p>Annotation at all layers except entity is based on the <a
href="ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/">Penn Treebank
II guidelines</a>, with a number of modifications that have been found
necessary, many of which were subsequently adopted by the Penn
Treebank. Entity definitions came originally from domain experts and
were developed and refined in dialogue with the annotators. All
annotation is standoff: the source text is never modified, annotations
being made in a separate file. Paragraph, sentence, tokenization, POS,
and syntactic annotation (treebanking) are applied by automatic taggers
and manually corrected; entity annotation is manual.</p>
<p> </p>
<p>The oncology data comprises two subcorpora: </p>
<ul>
<li>The Sanger subcorpus <i>(san)</i> consists of abstracts of 577
articles previously annotated by the Sanger Institute for global
mention of oncological named entities. These annotations were metadata
reflecting the presence or absence of such mentions anywhere in the
text. The articles concentrate
on variations in a small set of human genes associated with many
different types of cancer. We did not refer to
the Sanger annotations after selection of the abstracts. </li>
<li>The neuroblastoma subcorpus <i>(nb)</i> consists of 837
abstracts of articles dealing with this particular type of cancer
selected by colleagues at Children's Hospital of Philadelphia. They do
not all concentrate on genetics, but they mention a much larger number
of genes than the Sanger files do.</li>
</ul>
<hr size="2" width="100%"><br>
<div align="center"><font face="Courier New, Courier, monospace"><small><small><big>Ilya
Ahtaridis<br>
Membership Coordinator</big><br>
<br>
</small>--------------------------------------------------------------------</small><small><br>
</small></font></div>
<div align="center">
<pre class="moz-signature" cols="72"><font
face="Courier New, Courier, monospace">Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 <a
class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a
class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>
</div>
<pre class="moz-signature" cols="72">
</pre>
</body>
</html>