[Corpora-List] New Data from the LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Wed May 24 20:50:20 UTC 2006
LDC2006S26*
CSLU: Speaker Recognition Version 1.1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S26>
*
LDC2006T10
*English-Arabic Treebank V1.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T10>
*
LDC2006S33*
Middle East Technical University Turkish Microphone Speech V 1.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S33>
*
*
*In this month's newsletter, the Linguistic Data Consortium (LDC) would
like to announce the availability of three new publications.
------------------------------------------------------------------------
*New Publications*
(1) CSLU: Speaker Recognition Version 1.1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S26>
consists of telephone speech from 91 participants. Each participant has
recorded speech in twelve sessions over a two-year period answering
questions like "what is your eye color" or respond to prompts like
"describe a typical day in your life." Most of the utterances in the
corpus have corresponding non-time-aligned word level transcriptions.
The goal of Speaker Recognition data collection was to collect speech
from each participant over a two year period. Each participant called
the data collection system twelve times over the two-year period and
said the same utterances each time.
*
(2) English-Arabic Parallel Treebank V1.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T10>
consists of 52,238 words in 224 files of individual Agence France Presse
(AFP) news stories (corresponding to approximately the first 50K words
of the Arabic Treebank: Part 1 v 3.0 -- LDC Catalog No.: LDC2005T02).
The English translation was provided by LDC, and was part-of-speech
tagged and treebanked for this project.
The guidelines followed for both part-of-speech and treebank annotation
are essentially Penn Treebank II style, with two notable differences:
1. POS: tokenization of hyphenated items ("New York-based" has been
replaced by "New York - based" for example), and the addition of
HYPH and AFX tags necessitated by this change in tokenization
2. TreeBank: the addition of the node label NML for sub-NP nominal
constituents (replacing NX and most NP-internal NAC)
*
(3) Middle East Technical University Turkish Microphone Speech V 1.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S33>
corpus has been collected at the Middle East Technical University (METU)
as part of a collaborative work between the Department of Electrical and
Electronics Engineering of the Middle East Technical University in
Turkey and the Center for Spoken Language Research (CSLR) of the
University of Colorado at Boulder, USA. The corpus was used to port the
Speech Recognition System of CSLR, SONIC, to Turkish.
The corpus contains text, speech, and alignment files. 120 speakers (60
male and 60 female) spoke 40 sentences each for a total of approximately
500 minutes of speech. The 40 sentences were selected randomly for each
speaker from a triphone-balanced set of 2462 Turkish sentences. All
participants were native speakers of Turkish.
------------------------------------------------------------------------
If you need further information, or would like to inquire about
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215 573
1275.
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
3600 Market Street Fax: (215) 573-2175
Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20060524/b79de6b0/attachment.htm>
More information about the Corpora
mailing list