Corpora: New Corpora

Wed Oct 11 21:12:56 UTC 2000

The Linguistic Data Consortium is pleased to announce 3 new
corpora.

Voice of America (VOA) Czech Broadcast News Audio
http://morph.ldc.upenn.edu/Catalog/LDC2000S89.html
$900 for nonmembers

Between February 9 and May 28, 1999, the Linguistic Data
Consortium collected approximately 30 hours of broadcast audio
from the Voice of America news service in Czech. The 62 data
files presented in this corpus represent the audio of the daily
broadcasts of 30-minute news programs.

Voice of America (VOA) Czech Broadcast News Transcript Corpus
http://morph.ldc.upenn.edu/Catalog/LDC2000T53.html
$200 for nonmembers

The transcriptions were created by native Czech speakers,
working at the Department of Cybernetics, University of West
Bohemia (UWB) in Pilsen, under the direction of Josef Psutka and
Pavel Ircing. They used transcription software provided by the
LDC (the "transcriber" package, developed by Eduoard Geoffrois
and Claude Barras at DGA, France, with assistance from Zhibiao
Wu at the LDC; the package is currently available from the LDC
web site: www.ldc.upenn.edu.  The transcript files are presented
here in a format that was defined by the speech group at NIST,
who refer to it as the "Universal Transcription Format" (UTF --
not to be confused with the "Unicode Transformation Formats").
The transcription text is rendered using the ISO 8859-2
character set.

TREC Spanish
http://morph.ldc.upenn.edu/Catalog/LDC2000T51.html
$200 for nonmembers

This is the set of documents used for the Spanish task in TRECs
3-5. It consists of approximately 250 megabytes of the Mexican
newspaper El Norte and 300 megabytes of Agence France Presse
1994 newswire text, formatted to include TREC document IDs. The
El Norte documents were used for TRECs 3-4, and the Agence
France Presse documents for TREC 5. The topics (questions) and
relevance judgments (right answers) that complete the test
collections can be downloaded from the TREC web site
(http://trec.nist.gov) in the Data/Non-English section.  Users
who wish to receive this corpus must sign the user license which
can be obtained from
http://morph.ldc.upenn.edu/Catalog/mem_agree/trec-spa nish.html.

If you would like to order a copy of these corpora, please email
your request to <ldc at unagi.cis.upenn.edu>.  If you need
additional information before placing your order, or would like
to inquire about membership in the LDC, please send email or
call (215) 573-1275.

Further information about the LDC and its available corpora can
be accessed on the Linguistic Data Consortium WWW Home Page at
URL: http://www.ldc.upenn.edu/