Corpora: Re: English dialogue corpora

Matthew Purver matthew.purver at kcl.ac.uk
Wed Oct 18 12:20:02 UTC 2000


As promised, here's a summary of the information sent to me by helpfule
people in response to me query about English dialogue corpora. Thanks to
all who helped

Matt

--
Matthew Purver  <matthew.purver at kcl.ac.uk>

Computational Linguistics and Natural Language Processing Group
Department of Computer Science
King's College London, Strand, London WC2R 2LS

---------- Forwarded message ----------

> Spoken Professional American-English (CSPA):
> Size:
>  1M words academic committee meetings
>  1M words White House press conferences
> Cost:
>  79 dollars US (or 49 without PoS tags) (from Athelstan)
> Features:
>  SGML but little detail (no prosody, overlaps). PoS tags.
>
> Santa Barbara Corpus of Spoken American English (CSAE)
> (forms the US part of the ICE)
> Size:
>  14 texts of 15-30 mins each / 3 CD-ROMs
> Cost:
>  75 dollars US (from LDC)
> Features:
>  Overlaps, timing, prosody. No PoS tags.
>
> CALLHOME:
> Size:
>  230K words / 120 texts of 5 or 10 mins each (telephone conversations)
> Cost:
>  500 dollars US (from LDC)
> Features:
>  Not SGML - transcripts only.
>
> SWITCHBOARD:
> Size:
>  3M words / 2400 texts (telephone conversations) / 1 CD-ROM
> Cost:
>  100 dollars US (from LDC)
> Features:
>  Not SGML, but includes overlaps, pauses, non-speech events, timings.
>
> Verbmobil:
> Size:
>  about 500 dialogues / 3 CD-ROMs
> Cost:
>  255 euro (150 pounds) from ELDA
> Features:
>  Most of Verbmobil is German - these 3 CDs are the English part - some
>  German words & "Denglish" though.
>  Not sure of format - probably straight transliteration.
>
> British National Corpus (BNC):
> Size:
>  natural dialogue (volunteer wearing microphone, others unaware):
>  4M words / 153 texts / 85 Mb
>  context-governed (meetings etc.):
>  6M words / 762 texts / 100 Mb
> Cost:
>  220 pounds from OU, or 245 euro (= 150 pounds) from ELDA
> Features:
>  SGML (DTD available), PoS tags (CLAWS), speakers, timing, overlaps,
>  prosody.
>
> International Corpus of English, GB section (ICE-GB):
> Size:
>  about 0.6M words in 180 dialogue texts, of which 100 private
>  conversations, 80 context-governed
> Cost:
>  300 pounds from UCL
> Features:
>  SGML, PoS tags, speakers, timing, overlaps, parse tree,
>  no prosody.
>
> London-Lund Corpus (LLC):
> Size:
>  0.5M words / 100 texts, about 75% spontaneous dialogue (some
>  surreptitious)
> Cost:
>  3500 Norwegian kroner (= 260 pounds) as part of ICAME CD
> Features:
>  NOT standard SGML, includes prosody but no PoS tags.
>
> Bergen Corpus of London Teenage Language (COLT):
> Size:
>  0.5M words, all spontaneous dialogue (volunteer wearing microphone,
>  others unaware)
> Cost:
>  (part of ICAME CD)
> Features:
>  SGML, includes PoS tags (CLAWS), speakers, prosody.
>
> Wellington Corpus of Spoken English (WSC):
> Size:
>  0.5M words conversation (non-surreptitious), small amounts of
>  telephone, interviews etc.
> Cost:
>  (part of ICAME CD)
> Features:
>  SGML including prosody, no PoS tags. Some Maori words.
>
> Edinburgh HCRC Map Task Corpus (MTC):
> Size:
>  128 dialogue texts
> Cost:
>  165 pounds from Edinburgh, or 200 dollars US (136 pounds) from LDC
> Features:
>  SGML, includes actual recordings
>
> TRAINS spoken dialog corpus:
> Size:
>  55K words / 98 texts / 1 CD-ROM of task-oriented (goods shipment in
>  railway system) dialogues
> Cost:
>  150 dollars US (= 103 pounds) from LDC
> Features:
>  Plain text transcription, includes actual recordings



More information about the Corpora mailing list