[Corpora-List] New from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Fri Jul 24 14:48:27 UTC 2009


LDC2009S02
-  *Czech Broadcast Conversation Speech* 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S02>  -

LDC2009T20*
- Czech Broadcast Conversation MDE Transcripts 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T20>  -*

LDC2009T21*
-  S 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T21>panish 
Gigaword Second Edition 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T21>  -
*

The Linguistic Data Consortium (LDC) would like to announce the 
availability of three new publications.

------------------------------------------------------------------------

*New Publications*

(1) Czech Broadcast Conversation Speech 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S02> 
was prepared by researchers at the University of West Bohemia, Pilsen, 
Czech Republic, and consists of 40 hours of speech from Radioforum, a 
talk show broadcast on Czech Radio 1. Transcripts corresponding to the 
audio files in this corpus are provided in Czech Broadcast Conversation 
MDE Transcripts (LDC2009T20) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T20>. 

Czech Broadcast Conversation Speech consists of 72 single channel 
recordings of Radioforum, a live talk program broadcast by Czech Radio 1 
(CRo1) <http://www.rozhlas.cz/radiozurnal/portal/> every weekday 
evening. Its format consists of invited guests spontaneously answering 
topical questions posed by one or two interviewers. The number of 
interviewees in a single program varies from one to three, but 
typically, one interviewer and two interviewees appear in the program. 
The material includes passages of interactive dialogue, but longer 
stretches of monologue-like speech comprise the majority of the 
collected data. Radioforum also has an interactive segment where 
listeners call the studio and ask their own questions. That telephony 
speech was not transcribed in the current release.

Individual recordings range from 27 minutes to 36 minutes each. The 
recordings were collected during the period from February 12, 2003 
through June 26, 2003. The signal is mono, sampled at 22.05 kHz with 
16-bit resolution, stored in Windows PCM waveform format. The names of 
the audio files refer to the broadcast date (rfYYMMDD.wav).

*

(2) Czech Broadcast Conversation MDE Transcripts 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T20>* 
*was prepared by researchers at the University of West Bohemia, Pilsen, 
Czech Republic, and consists of approximately 33 hours of transcribed 
speech from Radioforum, a talk show broadcast on Czech Radio 1. The 
audio files corresponding to the transcripts in this corpus are 
contained in Czech Broadcast Conversation Speech (LDC2009S02) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S02>. 

Czech Broadcast Conversation MDE Transcripts was created to extend 
Metadata Extraction (MDE) research to conversational Czech. The goal of 
MDE is to take raw speech recognition output and refine it into forms 
that are of more use to humans and to downstream automatic processes. In 
simple terms, this means the creation of automatic transcripts that are 
maximally readable. This readability might be achieved in a number of 
ways: removing non-content words like filled pauses and discourse 
markers from the text; removing sections of disfluent speech; and 
creating boundaries between natural breakpoints in the flow of speech so 
that each sentence or other meaningful unit of speech might be presented 
on a separate line within the resulting transcript. Natural 
capitalization, punctuation and standardized spelling, plus sensible 
conventions for representing speaker turns and identity are further 
elements in the readable transcript.

The transcripts and annotations in this corpus are stored in three 
different formats: TRS (Transcriber <http://trans.sourceforge.net>), QAn 
(Quick Annotator <http://www.mde.zcu.cz/qan.html>), and RTTM. TRS 
represents a standard speech transcript. QAn and RTTM also contain 
information about structural metadata (MDE). Character encoding in all 
files is ISO-8859-2.

All filenames have the form rfYYMMDD.format where "rf" stands for 
Radioforum, the following six digits indicate the date of broadcast, and 
the extension ".format" corresponds to the data format of the particular 
file ".trs", ".qan", or ".rttm".

*

(3) Spanish Gigaword Second Edition 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T21> 
is a comprehensive archive of newswire text data that has been acquired 
over several years by LDC. This second edition updates Spanish Gigaword 
First Edition (LDC2006T12) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T12> 
and adds data collected from January 1, 2006 through December 31, 2008.

The three distinct international sources of Spanish newswire in this 
edition, and the time spans of collection covered for each, are as follows:

    * Agence France-Presse, Spanish Service (afp_spa) May 1994 - Dec 2008
    * Associated Press Worldstream, Spanish (apw_spa) Nov 1993 - Dec 2008
    * Xinhua News Agency, Spanish Service (xin_spa) Sep 2001 - Dec 2008

The seven-letter codes in the parentheses above include the 
three-character source name abbreviations and the three-character 
language code ("spa") separated by an underscore ("_") character. The 
three-letter language code conforms to LDC's internal convention based 
on the ISO 639-3 standard. These codes are used in the directory names 
where the data files are found and in the prefix that appears at the 
beginning of every data file name. They are also used (in all UPPER 
CASE) as the initial portion of the DOC "id" strings that uniquely 
identify each news story.

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
 Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090724/433b186f/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list