Corpora: New Corpora from the LDC
LDC Office
ldc at ldc.upenn.edu
Thu Jan 3 15:43:19 UTC 2002
** Chinese Treebank Version 2.0 **
** Switchboard Cellular Part 1 Audio **
** Switchboard Cellular Part 1 Transcription **
** Switchboard Cellular Part 1 Transcribed Audio **
The Linguistic Data Consortium (LDC) is pleased to announce the
availability of four new corpora.
**
1. Chinese Treebank Version 2.0 is the continuation of a project
started in Summer 1998; the project's goal is the creation of a 100,000
word corpus of Chinese with syntactic bracketing. The corpus contains
approximately 100,000 words drawn from 325 Xinhua newswire articles
dating from 1994 to 1998. Version 2.0 is GB encoded and formatted
similarly to the UPenn English Treebank except that some original file
information was retained such as "SRCID" and "DATE" in the data file.
Please note that Chinese Treebank 2.0 supersedes and replaces the
Chinese Penn Treebank Final Release (LDC2000T48).
For more information, including samples and a link to the The Chinese
Treebank Project website, please visit:
http://www.ldc.upenn.edu/Catalog/LDC2001T11.html
Institutions that have membership in the LDC during the 2001
Membership Year will be able to receive this corpus free of charge.
Nonmembers may purchase this publication for $200.
2. The Switchboard Cellular Part 1 project focused primarily on GSM
cellular phone technology. The project's goal was to target 190
subjects, balanced by gender, under varied environmental conditions to
participate in (10+) 5-6 minute conversations on GSM cellular phones.
The data was collected for research, development, and evaluation of
automatic systems for speech-to-text conversion, talker identification,
language identification and speech signal detection purposes.
Part 1 consists of three corpora: Audio, Transcriptions, and Transcribed
Audio. All three corpora contain documentation describing speaker
information, call information, and audit information.
The Switchboard Cellular Part 1 Audio release is a 13 CD-ROM publication
which contains approximately 65 hours of audio speech data. The Audio
corpus totals 1309 calls, or 2618 sides (1957 GSM), from 254
participants (129 Male, 125 Female). The data files are not compressed.
For further information, please visit:
http://www.ldc.upenn.edu/Catalog/LDC2001S13.html
Institutions that have membership in the LDC during the 2001
Membership Year will be able to receive this corpus free of charge.
Nonmembers may purchase this publication for $2600.
3. Switchboard Cellular Part 1 Transcription is an ftp file which
contains the 250 transcriptions of speech data files that correspond
with the Switchboard Cellular Part 1 Transcribed Audio (LDC2001S15).
Calls were transcribed using conventions similar to HUB-5 English.
For more information, including an example transcript, please visit:
http://www.ldc.upenn.edu/Catalog/LDC2001T14.html
Institutions that have membership in the LDC during the 2001
Membership Year will be able to receive this corpus free of charge.
Nonmembers may purchase this publication for $1000.
4. Switchboard Cellular Part 1 Transcribed Audio, a 3 CD-ROM
publication, contains the 250 speech data files that correspond with the
Switchboard Cellular Part 1 Transcription (LDC2001T14). The data files
are not compressed. There is approximately 12 hours of audio data.
For more information, please see:
http://www.ldc.upenn.edu/Catalog/LDC2001S15.html
Institutions that have membership in the LDC during the 2001
Membership Year will be able to receive this corpus free of charge.
Nonmembers may purchase this publication for $600.
**
If you need additional information before placing your order, or
would like to inquire about membership in the LDC, please send email to
<ldc at ldc.upenn.edu> or call (215) 573-1275.
---------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
3615 Market Street Fax: (215) 573-2175
Suite 200 email: ldc at unagi.cis.upenn.edu
Philadelphia, PA 19104-2608 www: http://www.ldc.upenn.edu
More information about the Corpora
mailing list