[Corpora-List] New Corpora from the LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Wed May 26 16:40:04 UTC 2004
LDC2004S04
** 2002 NIST Speaker Recognition Evaluation (SRE) **
**
**LDC2004T11
** Arabic Treebank: Part 3 v.1.0 * *
LDC2004S05
** ISL Meeting Corpus Speech Part 1 **
**
**LDC2004T10
** ISL Meeting Corpus Transcripts Part 1 **
*
*
The Linguistic Data Consortium (LDC) is pleased to announce the
availability of four new corpora.
*
(1) The 2002 NIST Speaker Recognition Evaluation is part of an ongoing
series of yearly evaluations conducted by NIST. These evaluations
provide an important contribution to the direction of research efforts
and the calibration of technical capabilities. They are intended to be
of interest to all researchers working on the general problem of text
independent speaker recognition.
The 2002 NIST Speaker Recognition Evaluation main data was extracted
from the Switchboard Cellular part 2. The extended data task used two
phases of Switchboard II, phases 2 and 3. This evaluation also included
the first multi-modal task, using data from the FBI voice database.
There are a total of 9153 speech files in sphere format, for a total of
~156 hours. 2002 NIST Speaker Recognition Evaluation is distributed on
2 DVD.
For further information, including a link to the 2002 NIST Speaker
Recognition Evaluation website, please visit:
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S04
Institutions that have membership in the LDC for the 2004 Membership
Year will be able to receive this corpus free of charge. Nonmembers may
license this data for US$1000.
*
(2) Arabic Treebank: Part 3 v 1.0 is the third part of a corpus of
1,000,000 words of Arabic Treebank, designed to support language
research and development of language technology for Modern Standard
Arabic. This corpus includes 600 stories from the An Nahar News Agency.
There are a total of 340,281 words (counting non-Arabic tokens such as
numbers and punctuation) in the 600 files - one story per file. New
features of annotation include complete vocalization (including case
endings), lemma IDs, and more specific POS tags for verbs and particles.
The corpus contains 293,035 Arabic-only word tokens (prior to the
separation of clitics), of which 290,842 (99.25%) were provided with an
acceptable morphological analysis and POS tag by the morphological
parser, and 2,193 (0.75%) were items that the morphological parser
failed to analyze correctly. Arabic Treebank: Part 3 v 1.0 is
distributed on 1 CD.
For further information, including online documentation, please visit:
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T11
Institutions that have membership in the LDC for the 2004 Membership
Year will be able to receive this corpus free of charge. Nonmembers may
license this data for US$3000.
*
(3) ISL Meeting Speech Part 1 is the first subset of the ISL Meeting
Corpus (112 meetings). It contains 18 meetings collected at the
Interactive Systems Laboratories at Carnegie Mellon University. The
recorded meetings were either natural meetings where participants needed
to meet in the real world, or artificial meetings, which were designed
explicitly for the purposes of data collection but still had real topics
and tasks. The duration of the meetings in this corpus ranges from 8 to
64 minutes and averages at 34 minutes. Word-level orthographic
transcriptions are available as ISL Meeting Transcripts Part 1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T10>.
ISL Meeting Speech Part 1 includes 105 speech files, for a total of
approximately 10 hours of meeting speech. There are a total of 31
unique speakers in the corpus. Meetings involved anywhere from 3 to 9
participants, averaging at 5. The corpus contains a significant
proportion of non-native English speakers, varying in fluency. ISL
Meeting Speech Part 1 is distributed on 2 DVD.
For further information, including a link to the ISL Meeting Room
project page, please visit:
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S05
Institutions that have membership in the LDC for the 2004 Membership
Year will be able to receive this corpus free of charge. Nonmembers may
license this data for US$1500.
*
(4) The ISL Meeting Transcripts Part 1 is the corresponding
transcription for ISL Meeting Speech Part 1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S05>.
This corpus consists of 19 word-level transcripts of 18 meetings, time
synchronized to digitized audio recordings. There are approximately
116200 word tokens and 5850 unique word types in the transcripts.
Transcriptions were prepared by means of the TransEdit transcription
application. This application was developed for the transcription of
multi-channel recordings and displays a synchronized multi-track view
for all channels of a meeting with listening and segmentation function
for each single channel separately. ISL Meeting Transcripts Part 1 is
distributed by ftp transfer.
For further information, including a sample transcript, please visit:
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T10
Institutions that have membership in the LDC for the 2004 Membership
Year will be able to receive this corpus free of charge. Nonmembers may
license this data for US$500.
*
If you need additional information or would like to inquire about
membership in the LDC, please send email to <ldc at ldc.upenn.edu> or call
1 (215) 573-1275.
----------------------------------------------------------------------------------------------------
Linguistic Data
Consortium
Phone: 1 (215) 573-1275
University of Pennsylvania
Fax: 1 (215) 573-2175
3600 Market St., Suite
810
email: ldc at ldc.upenn.edu
Philadelphia, PA
19104-2653 www:
http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20040526/d0cdf1b2/attachment.htm>
More information about the Corpora
mailing list