<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">

  <title></title>

</head>

<body text="#000000" bgcolor="#ffffff">

<div align="center">LDC2004S04<br>

<b>*  2002 NIST Speaker Recognition Evaluation (SRE)  *</b><br>

<b></b><br>

<b></b>LDC2004T11<br>

<b>*  Arabic Treebank: Part 3 v.1.0  * </b><br>

<br>

LDC2004S05<br>

<b>*  ISL Meeting Corpus Speech Part 1  *</b><br>

<b></b><br>

<b></b>LDC2004T10<br>

<b>*  ISL Meeting Corpus Transcripts Part 1  *</b><br>

</div>

<b><br>

</b><br>

<br>

<div align="center">The

Linguistic Data Consortium (LDC) is pleased to announce

the

availability of four new corpora.<br>

<br>

<br>

*<br>

</div>

<blockquote>

  <blockquote> </blockquote>

</blockquote>

<div align="center">

<p align="left"><br>

(1)  The 2002 NIST Speaker Recognition Evaluation is part of

an ongoing series of yearly evaluations conducted by NIST. These

evaluations provide an important contribution to the direction of

research efforts and the calibration of technical capabilities. They

are intended to be of interest to all researchers working on the

general problem of text independent speaker recognition.  </p>

<p align="left">The 2002 NIST Speaker Recognition Evaluation main data

was extracted from the Switchboard Cellular part 2. The extended data

task used two phases of Switchboard II, phases 2 and 3. This evaluation

also included the first multi-modal task, using data from the FBI voice

database. There are a total of 9153 speech files in sphere format, for

a total of ~156 hours.  2002 NIST Speaker Recognition Evaluation is

distributed on 2 DVD.<br>

</p>

<p align="left">For further information, including a link to the 2002

NIST Speaker Recognition Evaluation website, please visit:<br>

</p>

<p align="left"><a class="moz-txt-link-freetext"

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S04">http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S04</a><br>

</p>

<p align="left">Institutions that have membership in the LDC for the

2004 Membership

Year will be able to receive this corpus free of charge.  Nonmembers

may license this data for US$1000.<br>

<br>

</p>

*<br>

</div>

<br>

<p>(2)  Arabic Treebank: Part 3 v 1.0 is the third part of a corpus of

1,000,000 words of Arabic Treebank, designed to support language

research and development of language technology for Modern Standard

Arabic. 

This corpus includes 600 stories from the An Nahar News Agency. There

are a total of 340,281 words (counting non-Arabic tokens such as

numbers and punctuation) in the 600 files - one story per file. New

features of annotation include complete vocalization (including case

endings), lemma IDs, and more specific POS tags for verbs and

particles. </p>

<p>The corpus contains 293,035 Arabic-only word tokens (prior to the

separation of clitics), of which 290,842 (99.25%) were provided with an

acceptable morphological analysis and POS tag by the morphological

parser, and 2,193 (0.75%) were items that the morphological parser

failed to analyze correctly.  Arabic Treebank: Part 3 v 1.0 is

distributed on 1 CD. </p>

For further information, including online documentation, please visit:<br>

<br>

<a class="moz-txt-link-freetext"

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T11">http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T11</a><br>

<br>

Institutions that have membership in the LDC for the 2004 Membership

Year will be able to receive this corpus free of charge.  Nonmembers

may license this data for US$3000.<br>

<br>

<br>

<div align="center">*<br>

</div>

<br>

<p>(3)  ISL Meeting Speech Part 1 is the first subset of the ISL

Meeting

Corpus (112 meetings). It contains 18 meetings collected at the

Interactive Systems Laboratories at Carnegie Mellon University.  The

recorded meetings were

either natural meetings where participants needed to meet in the real

world, or artificial meetings, which were designed explicitly for the

purposes of data collection but still had real topics and tasks. The

duration of the meetings in this corpus ranges from 8 to 64 minutes and

averages at 34 minutes. Word-level orthographic transcriptions are

available as <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T10">ISL

Meeting Transcripts Part 1</a>. </p>

ISL Meeting Speech Part 1 includes 105 speech files, for a total of

approximately 10 hours of meeting speech.  There are a total of 31

unique speakers in the corpus. Meetings involved anywhere from 3 to 9

participants, averaging at 5. The corpus contains a significant

proportion of non-native English speakers, varying in fluency.  ISL

Meeting Speech Part 1 is distributed on 2 DVD.<br>

<p>For further information, including a link to the ISL Meeting Room

project page, please visit: <br>

</p>

<a class="moz-txt-link-freetext"

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S05">http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S05</a><br>

<br>

Institutions that have membership in the LDC for the 2004 Membership

Year will be able to receive this corpus free of charge.  Nonmembers

may license this data for US$1500.<br>

<br>

<br>

<div align="center">*<br>

</div>

<p><br>

(4)  The ISL Meeting Transcripts Part 1 is the corresponding

transcription for <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S05">ISL

Meeting Speech Part 1</a>.  This corpus consists of 19 word-level

transcripts of 18 meetings, time synchronized to digitized audio

recordings. There are approximately 116200 word tokens and 5850 unique

word types in the transcripts. </p>

<p>Transcriptions were prepared by means of the TransEdit transcription

application. This application was developed for the transcription of

multi-channel recordings and displays a synchronized multi-track view

for all channels of a meeting with listening and segmentation function

for each single channel separately.  ISL Meeting Transcripts Part 1 is

distributed by ftp transfer. </p>

For further information, including a sample transcript, please visit:<br>

<br>

<a class="moz-txt-link-freetext"

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T10">http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T10</a><br>

<br>

Institutions that have membership in the LDC for the 2004 Membership

Year will be able to receive this corpus free of charge.  Nonmembers

may license this data for US$500.<br>

<br>

<br>

<div align="center">*<br>

</div>

<br>

<br>

<div align="center">If you need additional information or would

like to inquire about membership in the LDC, please send email to <a

 class="moz-txt-link-rfc2396E" href="mailto:ldc@ldc.upenn.edu"><ldc@ldc.upenn.edu></a>

or call 1 (215) 573-1275.<br>

</div>

<br>

<br>

<br>

<div align="center">----------------------------------------------------------------------------------------------------<br>

Linguistic Data

Consortium                                                                     

Phone: 1 (215) 573-1275<br>

University of Pennsylvania                         

                                                 Fax: 1

(215) 573-2175<br>

3600 Market St., Suite

810                                                                    

email: <a class="moz-txt-link-abbreviated"

 href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a><br>

Philadelphia, PA

19104-2653                                                        www: <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></div>

<pre class="moz-signature" cols="72">

</pre>

<br>

<br>

</body>

</html>