[Corpora-List] Summary: Speech corpora by register

L Carmichael lesley at u.washington.edu
Sun Jun 1 17:33:24 UTC 2003


Hi Corpora List,

Some time ago, I asked the list for leads on (American English) speech
corpora of different registers of speech (i.e., controlled or labeled for
context, such as 'teacher talk,' 'doctor talk,' speech directed at
non-native speakers, lectures, casual speech between friends, etc.). Many
people wrote to me with recommendations for corpora that would be suitable
for text analysis. While my goal is actually to find corpora of SOUND
files, this information is still tremendously helpful (thank you!). I
finally present you with a summary:

1. Santa Barbara Corpus of Spoken American English (from LDC)
2. Switchboard (LDC)
3. CallHome (LDC)
4. MICASE (Michigan Corpus of Academic Spoken English) (freely available,
searchable online - http://www.lsa.umich.edu/eli/micase/micase.htm)
5. Saarbruecken Corpus of Spoken English (limited genres, mostly jokes)
6. T2K-SWAL (not publicly available)
7. Corpus of Spoken Professional English
(http://www.athel.com/corpdes.html)
8. Longman Grammar of Spoken and Written English (not publicly available;
overlaps with British National Corpus and Santa Barbara corpus)
9. British National Corpus (sound files may be available)
10. British Academic Spoken English
11. Dialogue Diversity Corpus (no speech files)
http://www-rcf.usc.edu/~billmann/diversity
12. Intonational Variation in English (IViE)
13. The London-Lund Corpus of Spoken English
14. The Lancaster/IBM SEC Corpus, The Machine-Readable Corpus of Spoken
English
15. The Wellington Corpus of Spoken New Zealand English (WSC)
16. The Bergen Corpus of London Teenage Language (COLT)
17. The International Corpus of English - East African component
18. The Polytechnic of Wales Corpus (children talking)
(13-18 from ICAME, corpora and manuals available -
http://www.hit.uib.no/icame/cd/)
19. CIRCLE Corpus, http://www.pitt.edu/~circle/Archive.htm
20. TRAINS Dialogue Corpus
http://www.cs.rochester.edu/research/cisd/resources/trains.html
21. ICE Singapore English Corpus
http://www-rcf.usc.edu/~billmann/diversity/ICE-SIN_Manual.PDF
22. Corpus meta-site http://devoted.to/corpora

Also, I want to share with you some of the comments I received:

1. One researcher who is extracting dialogue patterns mentioned that the
variation in annotation/markup presents problems for such work.
2. One researcher is seeking corpora of internet chat, so please post to
the list if you know of any!
3. It's clear that there are more well-developed resources for British
English than American English
4. Actual sound files are hard to come by (*please* post of you know of
any resources for American English speech not listed here!)
5. The researchers who responded to me were also interested in hearing of
other spoken American English corpora (please post if you know of others
not mentioned herein)

Thank you to all who helped me (David Lee, Bill Mann, Eric Atwell, Eric
Breck)! Your detailed assistance is sincerely appreciated!

Lesley Carmichael
Department of Linguistics
University of Washington



More information about the Corpora mailing list