[Corpora-List] New from the LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Tue Nov 25 22:51:49 UTC 2008
*LDC Spoken Language Sampler Available for Free Download*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S08>
LDC2008S09
*- CHAracterizing INdividual Speakers (CHAINS)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S09> -*
LDC2008T20
*- **PennBioIE CYP 1.0*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T20>*
-*
LDC2008T21
*- PennBioIE Oncology 1.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T21>**
-
**The Linguistic Data Consortium (LDC) would like to announce the
availability of a free spoken language sampler as well as the release of
three new publications.*
*
*
------------------------------------------------------------------------
*
*
*LDC Spoken Language Sampler Available for Free Download*
The LDC Spoken Language Sampler
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S08>
provides a variety of speech, transcript and lexicon samples and is
designed to illustrate the variety and breadth of the resources
available from LDC's Catalog. Created for distribution at NWAV 37 and
geared towards sociolinguists, the sampler is a good introduction to
data available from the LDC. The sampler includes excerpts from
telephone conversations in Arabic (Gulf, Iraqi, and Levantine dialects)
Farsi, Japanese, Korean, Spanish, and Tamil; dictionary resources for
Mawukakan and Tamil; transcribed meeting speech; utterances in Russian
from native and non-native speakers; and speech samples which represent
regional accents and dialects of the United States. Audio samples range
from 30 seconds to 90 seconds and are accompanied by transcripts.
The sampler can be downloaded for free from the catalog page for the LDC
Spoken Language Sampler
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S08>.
Please scroll down to 'How to Obtain' for a download link.
*
*
*New Publications*
*
*
(1) CHAracterizing INdividual Speakers (CHAINS)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S09>
contains recordings of thirty-six English speakers reading fables and
selected sentences in different speaking styles. The data was obtained
in two different sessions with a time separation of about two months.
The goal of the corpus is to provide a range of speaking styles and
voice modifi
cations for speakers sharing the same accentOther existing
corpora, in particular CSLU Speaker Recognition Version 1.1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S26>,
TIMIT
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1>
and the IViE corpus <http://www.phon.ox.ac.uk/IViE/> (English Intonation
in the British Isles), served as referents in the selection of material.
This design decision was made to ensure that methods designed and
evaluated on the CHAINS corpus might be directly testable on these other
corpora, which were recorded using quite different dialects and channel
characteristics.
The data was collected in two recording sessions in a total of six
different speaking styles:
* solo reading
* synchronous reading
* spontaneous speech ("retell")
* repetitive synchronous imitation ("rsi")
* whispered fast reading
* fast speech reading
In two of the speaking conditions adopted, speakers modifi
ed their
speech in a constrained fashion towards a known target; in the
synchronous condition, the speech of the co-speaker served as a target,
while in rsi, there was an explicit known static target. The presence of
a known target which speakers aim to copy raises the bar in the
discovery and design of procedures for automatic speaker identi
cation,
as the target speech provides a potentially highly confusing foil. The
whisper and fast speech conditions are also well defi
ned speaking
styles which require substantial voice modification by the speaker.
***
(2) The PennBioIE CYP
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T20>
corpus consists of 1100 PubMed
<http://www.ncbi.nlm.nih.gov/entrez/query.fcgi> abstracts on the
inhibition of cytochrome P450 enzymes. The abstracts comprise
approximately 313,000 total words of text. Each file has been tokenized
and its biomedical portions (274,000 total words) exhaustively annotated
for paragraph, sentence, and part of speech, and non-exhaustively
annotated for 5 types of biomedical named entity in three categories of
interest. 324 of the abstracts have also been syntactically annotated.
Annotation at all layers except entity is based on the Penn Treebank II
guidelines <ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/>, with a
number of modifications that have been found necessary, many of which
were subsequently adopted by the Penn Treebank. Entity definitions came
originally from domain experts and were developed and refined in
dialogue with the annotators. All annotation is standoff: the source
text is never modified, annotations being made in a separate file.
Paragraph, sentence, tokenization, POS, and syntactic annotation
(treebanking) are applied by automatic taggers and manually corrected;
entity annotation is manual.
*
(3) The PennBioIE Oncology
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T21>
corpus consists of 1414 PubMed
<http://www.ncbi.nlm.nih.gov/entrez/query.fcgi> abstracts on cancer,
concentrating on molecular genetics. The abstracts comprise
approximately 381,000 total words of text. Each file has been tokenized
and its biomedical portions (327,000 total words) exhaustively annotated
for paragraph, sentence, and part of speech, and non-exhaustively
annotated for 16 ("Level 1") or 23 ("Level 2") types of named entity.
318 of the abstracts have also been syntactically annotated.
Annotation at all layers except entity is based on the Penn Treebank II
guidelines <ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/>, with a
number of modifications that have been found necessary, many of which
were subsequently adopted by the Penn Treebank. Entity definitions came
originally from domain experts and were developed and refined in
dialogue with the annotators. All annotation is standoff: the source
text is never modified, annotations being made in a separate file.
Paragraph, sentence, tokenization, POS, and syntactic annotation
(treebanking) are applied by automatic taggers and manually corrected;
entity annotation is manual.
The oncology data comprises two subcorpora:
* The Sanger subcorpus /(san)/ consists of abstracts of 577 articles
previously annotated by the Sanger Institute for global mention of
oncological named entities. These annotations were metadata
reflecting the presence or absence of such mentions anywhere in
the text. The articles concentrate on variations in a small set of
human genes associated with many different types of cancer. We did
not refer to the Sanger annotations after selection of the abstracts.
* The neuroblastoma subcorpus /(nb)/ consists of 837 abstracts of
articles dealing with this particular type of cancer selected by
colleagues at Children's Hospital of Philadelphia. They do not all
concentrate on genetics, but they mention a much larger number of
genes than the Sanger files do.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20081125/397d934c/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list