Corpora: Re: English dialogue corpora
Matthew Purver
matthew.purver at kcl.ac.uk
Wed Oct 18 12:20:02 UTC 2000
As promised, here's a summary of the information sent to me by helpfule
people in response to me query about English dialogue corpora. Thanks to
all who helped
Matt
--
Matthew Purver <matthew.purver at kcl.ac.uk>
Computational Linguistics and Natural Language Processing Group
Department of Computer Science
King's College London, Strand, London WC2R 2LS
---------- Forwarded message ----------
> Spoken Professional American-English (CSPA):
> Size:
> 1M words academic committee meetings
> 1M words White House press conferences
> Cost:
> 79 dollars US (or 49 without PoS tags) (from Athelstan)
> Features:
> SGML but little detail (no prosody, overlaps). PoS tags.
>
> Santa Barbara Corpus of Spoken American English (CSAE)
> (forms the US part of the ICE)
> Size:
> 14 texts of 15-30 mins each / 3 CD-ROMs
> Cost:
> 75 dollars US (from LDC)
> Features:
> Overlaps, timing, prosody. No PoS tags.
>
> CALLHOME:
> Size:
> 230K words / 120 texts of 5 or 10 mins each (telephone conversations)
> Cost:
> 500 dollars US (from LDC)
> Features:
> Not SGML - transcripts only.
>
> SWITCHBOARD:
> Size:
> 3M words / 2400 texts (telephone conversations) / 1 CD-ROM
> Cost:
> 100 dollars US (from LDC)
> Features:
> Not SGML, but includes overlaps, pauses, non-speech events, timings.
>
> Verbmobil:
> Size:
> about 500 dialogues / 3 CD-ROMs
> Cost:
> 255 euro (150 pounds) from ELDA
> Features:
> Most of Verbmobil is German - these 3 CDs are the English part - some
> German words & "Denglish" though.
> Not sure of format - probably straight transliteration.
>
> British National Corpus (BNC):
> Size:
> natural dialogue (volunteer wearing microphone, others unaware):
> 4M words / 153 texts / 85 Mb
> context-governed (meetings etc.):
> 6M words / 762 texts / 100 Mb
> Cost:
> 220 pounds from OU, or 245 euro (= 150 pounds) from ELDA
> Features:
> SGML (DTD available), PoS tags (CLAWS), speakers, timing, overlaps,
> prosody.
>
> International Corpus of English, GB section (ICE-GB):
> Size:
> about 0.6M words in 180 dialogue texts, of which 100 private
> conversations, 80 context-governed
> Cost:
> 300 pounds from UCL
> Features:
> SGML, PoS tags, speakers, timing, overlaps, parse tree,
> no prosody.
>
> London-Lund Corpus (LLC):
> Size:
> 0.5M words / 100 texts, about 75% spontaneous dialogue (some
> surreptitious)
> Cost:
> 3500 Norwegian kroner (= 260 pounds) as part of ICAME CD
> Features:
> NOT standard SGML, includes prosody but no PoS tags.
>
> Bergen Corpus of London Teenage Language (COLT):
> Size:
> 0.5M words, all spontaneous dialogue (volunteer wearing microphone,
> others unaware)
> Cost:
> (part of ICAME CD)
> Features:
> SGML, includes PoS tags (CLAWS), speakers, prosody.
>
> Wellington Corpus of Spoken English (WSC):
> Size:
> 0.5M words conversation (non-surreptitious), small amounts of
> telephone, interviews etc.
> Cost:
> (part of ICAME CD)
> Features:
> SGML including prosody, no PoS tags. Some Maori words.
>
> Edinburgh HCRC Map Task Corpus (MTC):
> Size:
> 128 dialogue texts
> Cost:
> 165 pounds from Edinburgh, or 200 dollars US (136 pounds) from LDC
> Features:
> SGML, includes actual recordings
>
> TRAINS spoken dialog corpus:
> Size:
> 55K words / 98 texts / 1 CD-ROM of task-oriented (goods shipment in
> railway system) dialogues
> Cost:
> 150 dollars US (= 103 pounds) from LDC
> Features:
> Plain text transcription, includes actual recordings
More information about the Corpora
mailing list