Corpora: Membership Renewal

LDC Office ldc at unagi.cis.upenn.edu
Thu Jul 27 19:23:01 UTC 2000


Dear Member,

This message is to remind you that the LDC membership year now
matches the calendar year. Membership years 1999 and 2000 are
currently open. If you have not already done so, you may join for
either 1999 or 2000 or both.  Remember that joining the LDC for a
membership year is almost always preferable to buying corpora
outright. There are two reasons for this.  First, the cost of a
membership is typically less than the cost of buying several LDC
corpora. Second, due to restrictions imposed by some of our
information providers, some LDC corpora are not available to
non-members.  Membership Year 1999 will close at the end of this
calendar year.

We will be sending out the membership renewal notices for the
2001 Membership Year in early December. Of course, you may join a
future membership year any time.

If you have already joined for membership year 2000, thank you
for your patronage. If you have not yet joined, I would like to
remind you of the benefits of a 2000 membership to LDC.  So far
we have released 6 collections this year.  They are:

Chinese Treebank (preliminary release)
Hong Kong Laws Parallel Text
Hong Kong News Parallel Text
Korean Newswire Text
BLLIP 1987-89 WSJ Corpus Release 1
Santa Barbara Corpus of Spoken American English Part-I

You can find a link to a description page for each of these
corpora at:  http://morph.ldc.upenn.edu/Catalog/by_year.html#2000

Current year members also have access to LDC Online and the
ability to purchase corpora from previous membership years at the
media costs of $100 per CD.

We also plan to release the following corpora this year:

1998 HUB 4 Broadcast News Evaluation English Test Material
(LDC2000S86) - The evaluation test material used in the 1998
DARPA/NIST Continuous Speech Recognition Broadcast News Hub-4
English Benchmark Test administered by the NIST Spoken Natural
Language Processing Group.  Approximately three hours of English
Broadcast News from PRI, ABC News, Cable News Network, and the
University of Southern California with UTF Transcripts.

NRL Speech in Noisy Environments (SPINE) Audio Training Data -
The training data set of audio files of multiple speakers using
various vocoder and microphone headsets to communicate in
coordinated tasks at remote locations. Approximately 140
conversations of five minutes each.

NRL Speech in Noisy Environments (SPINE) Training Data
Transcripts - The transcript files for the previous SPINE audio
publication.

TDT-2 Careful Transcriptions - 10 hours of BC audio from the TDT-2
corpus transcribed to Hub-4 specification
for use in ASR.

TDT-2 Audio Mandarin - Audio of the VOA Chinese Broadcasts from
Feb-Jun 1998. The transcripts are provided in the TDT-2 Mandarin
Text or TDT-2 Multilanguage Text.

Czech VOA Audio and Transcripts - Approximately 30 hours of VOA
broadcast news in Czech collected during the summer of 1999 with
the associated transcripts created at the University of West
Bohemia in the Czech Republic (used in the JHU 1999 Summer
Workshops).

1999 HUB 4 Broadcast News Evaluation English Test Material - The
evaluation test material used in the 1999 DARPA/NIST Continuous
Speech Recognition Broadcast News Hub-4 English Benchmark Test.
Approximately one and a half hours of broadcast news audio and
transcripts.

1999 HUB 4 Broadcast News Evaluation Non English (Mandarin) Test
Material - The evaluation test material prepared in accordance
with the DARPA/NIST Continuous Speech Recognition Broadcast News
Hub-4 Non English Benchmark Test, however the test was not
conducted.  Approximately one and a half hours of broadcast news
audio and transcripts.

TREC Chinese - This is the set of documents used for the Chinese
task in TRECs 5-6.  It consists of approximately 170 megabytes of
articles drawn from the Peoples Daily newspaper and the Xinhua
newswire formatted to include TREC document ids.  The text is
Mandarin and is encoded using the Big 5 encoding scheme.  The
topics (questions) and relevance judgments (right answers) that
complete the test collections can be downloaded from the TREC web
site (http://trec.nist.gov) in the Data/Non-English section.

TREC Spanish - This is the set of documents used for the Spanish
task in TRECs 3-5.  It consists of approximately 250 megabytes of
the Mexican newspaper El Norte and 300 megabytes of Agence France
Presse 1994 newswire text formatted to include TREC document ids.
The El Norte documents were used for TRECs 3-4, and the Agence
France Presse documents for TREC 5.  The topics (questions) and
relevance judgments (right answers) that complete the test
collections can be downloaded from the TREC web site
(http://trec.nist.gov) in the Data/Non-English section.

Japanese Lexicon - This is a revised version of the CallHome
Japanese lexicon. Revisions include tagging of obsolete forms in
the original and additions of common place names, days of the
week, etc., that happen not to occur in the CallHome Japanese
transcripts.

Spanish Lexicon - revised version of the CallHome Spanish lexicon
-- contains additional lexical items from recent transcription
efforts.

Mandarin Lexicon - A pronunciation dictionary containing 44,404
words.  It covers both telephone and broadcast speech transcripts
and text data (newswire) of hub4 Mandarin and hub5 Mandarin. A
new version - covering ALL of the transcripts and text data - is
being compiled at present. The lexicon is text-based and
GB-encoded.

Thai Newswire - The Thai newswire Krungthep Turakij was collected
from May, 1997 until July, 1999. It is encoded in TIS-620, and
has been tagged using a simple, Tipster style tagging scheme.

TDT-3 Text/Audio - audio and text from the TDT-1999 includes 8
English and 3 Mandarin sources (television, radio and newswire)
collected from Oct-Dec 1998 divided into stories and exhaustively
annotated for relevance to 60 topic selected from the corpus.

Classified Ads collected from the Internet sites of several major
newspapers. The ads have been annotated to show the full reading
of both standard and non-standard abbreviation used. The corpus
was collected for the JHU 1999 Summer Workshop and annotated by
Richard Sproat and his group at the workshop.

LDC Named Entity Tags - Named entity style annotations following
the TREC Named-Entity task definition.

If you would like to receive further information about these
corpora or request an invoice for the 2000 membership year,
please write to <ldc at ldc.upenn.edu> or call 215.573.1275.

Best,

Shannon Sears
Manager, Intellectual Property Rights and Membership
----------------------------------------------------------------------
Linguistic Data Consortium          Phone: (215) 573-1275
3615 Market Street                  Fax:   (215) 573-2175
Suite 200                           email: ldc at ldc.upenn.edu
Philadelphia, PA 19104-2608         www: http://www.ldc.upenn.edu



More information about the Corpora mailing list