15.1484, Qs: English Speech Corpora

Tue May 11 15:56:26 UTC 2004

LINGUIST List:  Vol-15-1484. Tue May 11 2004. ISSN: 1068-4875.

Subject: 15.1484, Qs: English Speech Corpora

Moderators: Anthony Aristar, Wayne State U.<aristar at linguistlist.org>
            Helen Dry, Eastern Michigan U. <hdry at linguistlist.org>

Reviews (reviews at linguistlist.org):
	Sheila Collberg, U. of Arizona
	Terence Langendoen, U. of Arizona

Home Page:  http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.

Editor for this issue: Steve Moran <steve at linguistlist.org>
 ==========================================================================
We'd like to remind readers that the responses to queries are usually
best posted to the individual asking the question. That individual is
then strongly encouraged to post a summary to the list. This policy was
instituted to help control the huge volume of mail on LINGUIST; so we
would appreciate your cooperating with it whenever it seems appropriate.

In addition to posting a summary, we'd like to remind people that it
is usually a good idea to personally thank those individuals who have
taken the trouble to respond to the query.

To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.

=================================Directory=================================

1)
Date:  Mon, 10 May 2004 12:09:19 -0700
From:  "Ingo Plag" <plag at anglistik.uni-siegen.de>
Subject:  speech corpora

-------------------------------- Message 1 -------------------------------

Date:  Mon, 10 May 2004 12:09:19 -0700
From:  "Ingo Plag" <plag at anglistik.uni-siegen.de>
Subject:  speech corpora

Dear Linguist Listers,

I have two queries concerning English speech corpora.

1. I am looking for a speech corpus (language: English) that is
part-of- speech tagged and has soundfiles, transcriptions and
part-of-speech tags aligned. Furthermore, it needs to be of
considerable size (> 100,000 word tokens, if possible). Can anyone
point me towards pertinent corpora?

So far I only found one corpus that meets all the criteria mentioned
above, the Boston University Radio News Corpus.

2. In spite of hour-long efforts and the help of experienced
colleagues I have not managed to open the example files of the BU
Radio News Corpus properly, no matter whether I used PRAAT,
Wavesurfer, or Transcriber. All three programs can open the sound file
(.sph) without problems but neither of the programs can access the
files with the transcription or the part-of- speech tags and align
this information with the sound wave. Can anyone help? Which
program(s) can do the job?

Any help will be greatly appreciated.

Many thanks in advance!

Best regards,
Ingo Plag

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Prof. Dr. Ingo Plag
English Linguistics
Fachbereich 3
Universitaet-Gesamthochschule Siegen
Adolf-Reichwein-Str. 2
D-57068 Siegen

http://www.uni-siegen.de/~engspra/
tel. 0271-740-2560
tel. 0271-740-2349 (secretary)
fax 0271-740-3246
e-mail: plag at anglistik.uni-siegen.de
tel.: 06422-2817 (home)

office: room AR-K 103
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

---------------------------------------------------------------------------
LINGUIST List: Vol-15-1484