[Corpora-List] Query about corpora of spoken English

Fri Dec 2 15:55:03 UTC 2005

R.M.Salkie at bton.ac.uk wrote:
> My colleague Nicolas Ballier (nicolas.ballier at lli.univ-paris13.fr
> <mailto:nicolas.ballier at lli.univ-paris13.fr> ) has asked me to post the
> following two queries.  Please reply directly to him. 

It may be useful to others to have the replies in a public forum like this 
one - so here is a quick reply to the CORPORA list.

> 1.      Is there a web page which lists currently available corpora of
> spoken English (eg MARSEC MAchine REadable Spoken ENglish Corpus), stating
> whether the sound files are available?

You could try the catalogue pages of:-

a)  Linguistic Data Consortium - subset "speech"-
http://www.ldc.upenn.edu/Catalog/byType.jsp#speech

b) Evaluations and Language Resources DIstribution Agency -
http://www.elda.org/rubrique6.html

c) International Computer Archive of Modern and Medieval English
http://nora.hd.uib.no/whatis.html

d) The MARSEC corpus
http://www.rdg.ac.uk/AcaDepts/ll/speechlab/marsec/

> 2.      Is there software available to align texts and sound files: for
> example, software that enables the user to listen to any part of the
> document by clicking on a word in the text?

First the soundfile needs to be aligned with the linguistic annotation.  Some 
popular applications currently used for doing this manually are the following 
(there are other applications for automatic segmentation of speech files). 
All of these can be used to click on and listen to an individual word once a 
word-level segmentation has been carried out.

a)  Praat (has a very flexible scripting language):
http://www.fon.hum.uva.nl/praat/

b)  Emu (segment-level and also higher linguistic levels, plus hierarchical 
structure: has some scripting capability for automatic building of trees):
http://emu.sourceforge.net/

c) Transcriber ("It provides a user-friendly graphical user interface for 
segmenting long duration speech recordings, transcribing them, and labeling 
speech turns, topic changes and acoustic conditions. It is more specifically 
designed for the annotation of broadcast news recordings, for creating 
corpora used in the development of automatic broadcast news transcription 
systems, but its features might be found useful in other areas of speech 
research.")
http://trans.sourceforge.net/en/presentation.php

d) MATE workbench ("a program designed to aid in the display, editing and 
querying of annotated speech corpora")
http://www.cogsci.ed.ac.uk/~dmck/MateCode/

These are by no means the only tools available (I have omitted xlabel, as it 
is no longer supported).

Best regards

Briony Williams