[Corpora-List] New from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Fri Jul 24 14:48:27 UTC 2009
LDC2009S02
- *Czech Broadcast Conversation Speech*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S02> -
LDC2009T20*
- Czech Broadcast Conversation MDE Transcripts
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T20> -*
LDC2009T21*
- S
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T21>panish
Gigaword Second Edition
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T21> -
*
The Linguistic Data Consortium (LDC) would like to announce the
availability of three new publications.
------------------------------------------------------------------------
*New Publications*
(1) Czech Broadcast Conversation Speech
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S02>
was prepared by researchers at the University of West Bohemia, Pilsen,
Czech Republic, and consists of 40 hours of speech from Radioforum, a
talk show broadcast on Czech Radio 1. Transcripts corresponding to the
audio files in this corpus are provided in Czech Broadcast Conversation
MDE Transcripts (LDC2009T20)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T20>.
Czech Broadcast Conversation Speech consists of 72 single channel
recordings of Radioforum, a live talk program broadcast by Czech Radio 1
(CRo1) <http://www.rozhlas.cz/radiozurnal/portal/> every weekday
evening. Its format consists of invited guests spontaneously answering
topical questions posed by one or two interviewers. The number of
interviewees in a single program varies from one to three, but
typically, one interviewer and two interviewees appear in the program.
The material includes passages of interactive dialogue, but longer
stretches of monologue-like speech comprise the majority of the
collected data. Radioforum also has an interactive segment where
listeners call the studio and ask their own questions. That telephony
speech was not transcribed in the current release.
Individual recordings range from 27 minutes to 36 minutes each. The
recordings were collected during the period from February 12, 2003
through June 26, 2003. The signal is mono, sampled at 22.05 kHz with
16-bit resolution, stored in Windows PCM waveform format. The names of
the audio files refer to the broadcast date (rfYYMMDD.wav).
*
(2) Czech Broadcast Conversation MDE Transcripts
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T20>*
*was prepared by researchers at the University of West Bohemia, Pilsen,
Czech Republic, and consists of approximately 33 hours of transcribed
speech from Radioforum, a talk show broadcast on Czech Radio 1. The
audio files corresponding to the transcripts in this corpus are
contained in Czech Broadcast Conversation Speech (LDC2009S02)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S02>.
Czech Broadcast Conversation MDE Transcripts was created to extend
Metadata Extraction (MDE) research to conversational Czech. The goal of
MDE is to take raw speech recognition output and refine it into forms
that are of more use to humans and to downstream automatic processes. In
simple terms, this means the creation of automatic transcripts that are
maximally readable. This readability might be achieved in a number of
ways: removing non-content words like filled pauses and discourse
markers from the text; removing sections of disfluent speech; and
creating boundaries between natural breakpoints in the flow of speech so
that each sentence or other meaningful unit of speech might be presented
on a separate line within the resulting transcript. Natural
capitalization, punctuation and standardized spelling, plus sensible
conventions for representing speaker turns and identity are further
elements in the readable transcript.
The transcripts and annotations in this corpus are stored in three
different formats: TRS (Transcriber <http://trans.sourceforge.net>), QAn
(Quick Annotator <http://www.mde.zcu.cz/qan.html>), and RTTM. TRS
represents a standard speech transcript. QAn and RTTM also contain
information about structural metadata (MDE). Character encoding in all
files is ISO-8859-2.
All filenames have the form rfYYMMDD.format where "rf" stands for
Radioforum, the following six digits indicate the date of broadcast, and
the extension ".format" corresponds to the data format of the particular
file ".trs", ".qan", or ".rttm".
*
(3) Spanish Gigaword Second Edition
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T21>
is a comprehensive archive of newswire text data that has been acquired
over several years by LDC. This second edition updates Spanish Gigaword
First Edition (LDC2006T12)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T12>
and adds data collected from January 1, 2006 through December 31, 2008.
The three distinct international sources of Spanish newswire in this
edition, and the time spans of collection covered for each, are as follows:
* Agence France-Presse, Spanish Service (afp_spa) May 1994 - Dec 2008
* Associated Press Worldstream, Spanish (apw_spa) Nov 1993 - Dec 2008
* Xinhua News Agency, Spanish Service (xin_spa) Sep 2001 - Dec 2008
The seven-letter codes in the parentheses above include the
three-character source name abbreviations and the three-character
language code ("spa") separated by an underscore ("_") character. The
three-letter language code conforms to LDC's internal convention based
on the ISO 639-3 standard. These codes are used in the directory names
where the data files are found and in the prefix that appears at the
beginning of every data file name. They are also used (in all UPPER
CASE) as the initial portion of the DOC "id" strings that uniquely
identify each news story.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090724/433b186f/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list