[Corpora-List] speech corpora

Ingo Plag plag at anglistik.uni-siegen.de
Thu May 13 19:08:30 UTC 2004


Dear Corpora Listers,

I have two queries concerning English speech corpora.

1. I am looking for a speech corpus (language: English) that is part-of-
speech tagged and has soundfiles, transcriptions and part-of-speech tags
aligned. Furthermore, it needs to be of considerable size (> 100,000 word
tokens, if possible). Can anyone point me towards pertinent corpora?

So far I only found one corpus that meets all the criteria mentioned
above, the Boston University Radio News Corpus.

2. In spite of hour-long efforts and the help of experienced colleagues I
have not managed to open the example files of the BU Radio News Corpus
properly, no matter whether I used PRAAT, Wavesurfer, or Transcriber. All
three programs can open the sound file (.sph) without problems but neither
of the programs can access the files with the transcription or the part-of-
speech tags and align this information with the sound wave. Can anyone
help? Which program(s) can do the job?

Any help will be greatly appreciated.

Many thanks in advance!

Best regards,
Ingo Plag

--
Ingo Plag
Linguistics Research Center
University of California at Santa Cruz
Santa Cruz CA 95060
USA

plag at anglistik.uni-siegen.de

phone (+1)-831-459-3823
fax (+1)-831-459-3334 (c/o Junko Ito)

phone at home: (+1)-831-429-1306



More information about the Corpora mailing list