[Corpora-List] New Data from the LDC

Wed May 24 20:50:20 UTC 2006

LDC2006S26*
CSLU: Speaker Recognition Version 1.1 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S26>
*

LDC2006T10
*English-Arabic Treebank V1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T10>
*

LDC2006S33*
Middle East Technical University Turkish Microphone Speech V 1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S33>
*

*
*In this month's newsletter, the Linguistic Data Consortium (LDC) would 
like to announce the availability of three new publications.

------------------------------------------------------------------------

*New Publications*

(1)  CSLU: Speaker Recognition Version 1.1 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S26> 
consists of telephone speech from 91 participants. Each participant has 
recorded speech in twelve sessions over a two-year period answering 
questions like "what is your eye color" or respond to prompts like 
"describe a typical day in your life." Most of the utterances in the 
corpus have corresponding non-time-aligned word level transcriptions.

The goal of Speaker Recognition data collection was to collect speech 
from each participant over a two year period. Each participant called 
the data collection system twelve times over the two-year period and 
said the same utterances each time. 

*

(2)  English-Arabic Parallel Treebank V1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T10> 
consists of 52,238 words in 224 files of individual Agence France Presse 
(AFP) news stories (corresponding to approximately the first 50K words 
of the Arabic Treebank: Part 1 v 3.0 -- LDC Catalog No.: LDC2005T02). 
The English translation was provided by LDC, and was part-of-speech 
tagged and treebanked for this project.

The guidelines followed for both part-of-speech and treebank annotation 
are essentially Penn Treebank II style, with two notable differences:

   1. POS: tokenization of hyphenated items ("New York-based" has been
      replaced by "New York - based" for example), and the addition of
      HYPH and AFX tags necessitated by this change in tokenization
   2. TreeBank: the addition of the node label NML for sub-NP nominal
      constituents (replacing NX and most NP-internal NAC)

*

(3)  Middle East Technical University Turkish Microphone Speech V 1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S33> 
corpus has been collected at the Middle East Technical University (METU) 
as part of a collaborative work between the Department of Electrical and 
Electronics Engineering of the Middle East Technical University in 
Turkey and the Center for Spoken Language Research (CSLR) of the 
University of Colorado at Boulder, USA.  The corpus was used to port the 
Speech Recognition System of CSLR, SONIC, to Turkish.

The corpus contains text, speech, and alignment files.  120 speakers (60 
male and 60 female) spoke 40 sentences each for a total of approximately 
500 minutes of speech. The 40 sentences were selected randomly for each 
speaker from a triphone-balanced set of 2462 Turkish sentences. All 
participants were native speakers of Turkish.

------------------------------------------------------------------------

If you need further information, or would like to inquire about 
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215 573 
1275.

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
3600 Market Street                             Fax:   (215) 573-2175
Suite 810                             	    	   ldc at ldc.upenn.edu
Philadelphia, PA 19104                 	    http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20060524/b79de6b0/attachment.htm>