8.1234, FYI: LDC Corpus, Spanish Corpora

Thu Aug 28 04:44:42 UTC 1997

LINGUIST List:  Vol-8-1234. Thu Aug 28 1997. ISSN: 1068-4875.

Subject: 8.1234, FYI: LDC Corpus, Spanish Corpora

Moderators: Anthony Rodrigues Aristar: Texas A&M U. <aristar at linguistlist.org>
            Helen Dry: Eastern Michigan U. <hdry at linguistlist.org>
            T. Daniel Seely: Eastern Michigan U. <seely at linguistlist.org>

Review Editor:     Andrew Carnie <carnie at linguistlist.org>

Associate Editors: Ljuba Veselinova <ljuba at linguistlist.org>
                   Ann Dizdar <ann at linguistlist.org>
Assistant Editor:  Martin Jacobsen <marty at linguistlist.org>

Software development: John H. Remmers <remmers at emunix.emich.edu>
                      Zhiping Zheng <zzheng at online.emich.edu>

Home Page:  http://linguistlist.org/

Editor for this issue: Martin Jacobsen <marty at linguistlist.org>

=================================Directory=================================

1)
Date:  Wed, 27 Aug 1997 20:06:32 EDT
From:  LDC Office <ldc at unagi.cis.upenn.edu>
Subject:  New Corpus from the Linguistic Data Consortium

2)
Date:  Wed, 27 Aug 1997 11:24:27 -0500 (CDT)
From:  lhartman at siu.edu (Lee Hartman)
Subject:  Spanish corpora

-------------------------------- Message 1 -------------------------------

Date:  Wed, 27 Aug 1997 20:06:32 EDT
From:  LDC Office <ldc at unagi.cis.upenn.edu>
Subject:  New Corpus from the Linguistic Data Consortium

               Announcing a NEW RELEASE from the
                   LINGUISTIC DATA CONSORTIUM

	      Boston University Radio Speech Corpus

The Boston University Radio Speech Corpus was collected by Mari
Ostendorf of Boston University, primarily to support research in
text-to-speech synthesis, particularly generation of prosodic
patterns.  The corpus consists of professionally read radio news data,
including speech and accompanying annotations, suitable for speech and
language research.

The corpus includes speech from seven (4 male, 3 female) FM radio news
announcers associated with WBUR, a public radio station.  The main
radio news portion of the corpus consists of over seven hours of news
stories recorded in the WBUR radio studio during broadcasts over a two
year period.  In addition, the announcers were also recorded in a
laboratory at Boston University.  In this, the lab news portion, the
announcers read a total of 24 stories from the radio news portion.
The announcers were first asked to read the stories in their non-radio
style and then, 30 minutes later. to read the same stories in their
radio style.

Each story read by an announcer was digitized in paragraph size units,
which typically include several sentences.  The files were digitized
at a 16k Hz sample rate using a 16 bit A/D.  The paragraphs were
annotated with the orthographic transcription, phonetic alignments,
part-of-speech tags and prosodic markers.  The orthographic
transcripts were generated by hand and include indication of where the
speaker took a breath.  The phonetic alignments and part-of-speech
tags were generated automatically and hand corrected.  The prosodic
labels were marked by hand and are available only for a subset of the
corpus.

Institutions that have membership in the LDC for either the 1996 or
1997 Membership Year will be able to receive the BU Radio Corpus at no
additional charge, in the same manner as all other speech corpora
published by the LDC.

Nonmembers can receive a copy of this corpus for research purposes
only for a fee of US$400.  If you would like to order a copy of this
corpus, please email your request to ldc at unagi.cis.upenn.edu. If you
need additional information before placing your order, or would like
to inquire about membership in the LDC, please send email or call
(215) 898-0464.

Further information about the LDC and its available corpora can be
accessed on the Linguistic Data Consortium WWW Home Page at URL
http://www.ldc.upenn.edu/. Information is also available via ftp at
ftp.cis.upenn.edu under pub/ldc; for ftp access, please use
"anonymous" as your login name, and give your email address when asked
for password.

-------------------------------- Message 2 -------------------------------

Date:  Wed, 27 Aug 1997 11:24:27 -0500 (CDT)
From:  lhartman at siu.edu (Lee Hartman)
Subject:  Spanish corpora

Nick Caffrey asked

>Does anyone have details of online Spanish corpora?

There is a corpus of written Argentine and Chilean Spanish, and
transcribed spoken Peninsular Spanish online at

        http://lola.lllf.uam.es

Because of recent technical difficulties, it may be temporarily
inaccessible, but keep trying.

- ------------------------------------------------------------------
Lee Hartman
Dept. of Foreign Languages
Southern Illinois University
Carbondale, IL 62901-4521
U.S.A.

---------------------------------------------------------------------------
LINGUIST List: Vol-8-1234