<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">
<title></title>
</head>
<body text="#000000" bgcolor="#ffffff">
<div align="center">LDC2004S04<br>
<b>* 2002 NIST Speaker Recognition Evaluation (SRE) *</b><br>
<b></b><br>
<b></b>LDC2004T11<br>
<b>* Arabic Treebank: Part 3 v.1.0 * </b><br>
<br>
LDC2004S05<br>
<b>* ISL Meeting Corpus Speech Part 1 *</b><br>
<b></b><br>
<b></b>LDC2004T10<br>
<b>* ISL Meeting Corpus Transcripts Part 1 *</b><br>
</div>
<b><br>
</b><br>
<br>
<div align="center">The
Linguistic Data Consortium (LDC) is pleased to announce
the
availability of four new corpora.<br>
<br>
<br>
*<br>
</div>
<blockquote>
<blockquote> </blockquote>
</blockquote>
<div align="center">
<p align="left"><br>
(1) The 2002 NIST Speaker Recognition Evaluation is part of
an ongoing series of yearly evaluations conducted by NIST. These
evaluations provide an important contribution to the direction of
research efforts and the calibration of technical capabilities. They
are intended to be of interest to all researchers working on the
general problem of text independent speaker recognition. </p>
<p align="left">The 2002 NIST Speaker Recognition Evaluation main data
was extracted from the Switchboard Cellular part 2. The extended data
task used two phases of Switchboard II, phases 2 and 3. This evaluation
also included the first multi-modal task, using data from the FBI voice
database. There are a total of 9153 speech files in sphere format, for
a total of ~156 hours. 2002 NIST Speaker Recognition Evaluation is
distributed on 2 DVD.<br>
</p>
<p align="left">For further information, including a link to the 2002
NIST Speaker Recognition Evaluation website, please visit:<br>
</p>
<p align="left"><a class="moz-txt-link-freetext"
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S04">http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S04</a><br>
</p>
<p align="left">Institutions that have membership in the LDC for the
2004 Membership
Year will be able to receive this corpus free of charge. Nonmembers
may license this data for US$1000.<br>
<br>
</p>
*<br>
</div>
<br>
<p>(2) Arabic Treebank: Part 3 v 1.0 is the third part of a corpus of
1,000,000 words of Arabic Treebank, designed to support language
research and development of language technology for Modern Standard
Arabic.
This corpus includes 600 stories from the An Nahar News Agency. There
are a total of 340,281 words (counting non-Arabic tokens such as
numbers and punctuation) in the 600 files - one story per file. New
features of annotation include complete vocalization (including case
endings), lemma IDs, and more specific POS tags for verbs and
particles. </p>
<p>The corpus contains 293,035 Arabic-only word tokens (prior to the
separation of clitics), of which 290,842 (99.25%) were provided with an
acceptable morphological analysis and POS tag by the morphological
parser, and 2,193 (0.75%) were items that the morphological parser
failed to analyze correctly. Arabic Treebank: Part 3 v 1.0 is
distributed on 1 CD. </p>
For further information, including online documentation, please visit:<br>
<br>
<a class="moz-txt-link-freetext"
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T11">http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T11</a><br>
<br>
Institutions that have membership in the LDC for the 2004 Membership
Year will be able to receive this corpus free of charge. Nonmembers
may license this data for US$3000.<br>
<br>
<br>
<div align="center">*<br>
</div>
<br>
<p>(3) ISL Meeting Speech Part 1 is the first subset of the ISL
Meeting
Corpus (112 meetings). It contains 18 meetings collected at the
Interactive Systems Laboratories at Carnegie Mellon University. The
recorded meetings were
either natural meetings where participants needed to meet in the real
world, or artificial meetings, which were designed explicitly for the
purposes of data collection but still had real topics and tasks. The
duration of the meetings in this corpus ranges from 8 to 64 minutes and
averages at 34 minutes. Word-level orthographic transcriptions are
available as <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T10">ISL
Meeting Transcripts Part 1</a>. </p>
ISL Meeting Speech Part 1 includes 105 speech files, for a total of
approximately 10 hours of meeting speech. There are a total of 31
unique speakers in the corpus. Meetings involved anywhere from 3 to 9
participants, averaging at 5. The corpus contains a significant
proportion of non-native English speakers, varying in fluency. ISL
Meeting Speech Part 1 is distributed on 2 DVD.<br>
<p>For further information, including a link to the ISL Meeting Room
project page, please visit: <br>
</p>
<a class="moz-txt-link-freetext"
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S05">http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S05</a><br>
<br>
Institutions that have membership in the LDC for the 2004 Membership
Year will be able to receive this corpus free of charge. Nonmembers
may license this data for US$1500.<br>
<br>
<br>
<div align="center">*<br>
</div>
<p><br>
(4) The ISL Meeting Transcripts Part 1 is the corresponding
transcription for <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S05">ISL
Meeting Speech Part 1</a>. This corpus consists of 19 word-level
transcripts of 18 meetings, time synchronized to digitized audio
recordings. There are approximately 116200 word tokens and 5850 unique
word types in the transcripts. </p>
<p>Transcriptions were prepared by means of the TransEdit transcription
application. This application was developed for the transcription of
multi-channel recordings and displays a synchronized multi-track view
for all channels of a meeting with listening and segmentation function
for each single channel separately. ISL Meeting Transcripts Part 1 is
distributed by ftp transfer. </p>
For further information, including a sample transcript, please visit:<br>
<br>
<a class="moz-txt-link-freetext"
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T10">http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T10</a><br>
<br>
Institutions that have membership in the LDC for the 2004 Membership
Year will be able to receive this corpus free of charge. Nonmembers
may license this data for US$500.<br>
<br>
<br>
<div align="center">*<br>
</div>
<br>
<br>
<div align="center">If you need additional information or would
like to inquire about membership in the LDC, please send email to <a
class="moz-txt-link-rfc2396E" href="mailto:ldc@ldc.upenn.edu"><ldc@ldc.upenn.edu></a>
or call 1 (215) 573-1275.<br>
</div>
<br>
<br>
<br>
<div align="center">----------------------------------------------------------------------------------------------------<br>
Linguistic Data
Consortium
Phone: 1 (215) 573-1275<br>
University of Pennsylvania
Fax: 1
(215) 573-2175<br>
3600 Market St., Suite
810
email: <a class="moz-txt-link-abbreviated"
href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a><br>
Philadelphia, PA
19104-2653 www: <a
class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></div>
<pre class="moz-signature" cols="72">
</pre>
<br>
<br>
</body>
</html>