data

Brian MacWhinney macw at CMU.EDU
Wed Oct 9 20:07:40 UTC 2002


Bill,

  Thanks for the note on dialog corpora on the web and elsewhere.  The
status of these things is changing so fast that it is hard to keep up, but
let me add a few notes:

1.  The CHILDES database (http://childes.psy.cmu.edu), which you correctly
note as having been available on CD-ROM for many years is now available with
a lot of associated audio and even, in a few cases, video.  It has now been
converted to (1) Unicode, allowing inclusion of IPA and non-Roman scripts in
the same file and (2) to XML, allowing lots of new uses.

2.  The Nixon Watergate tapes you mentioned are being retranscribed in CA
format by Gail Jefferson with assistance from Johannes Wagner.  These data
are at http://talkbank.org/data/MOVIN/ along with four other corpora from
English, Danish, German, and Italian.

3.  The Santa Barbara corpus of Spoken American English is available along
with linked audio from http://talkbank.org/data/conversation/, along with
some of the most interesting conversations from the LDC CallFriend database.
LDC would be willing to release further segments of this corpus if there was
good evidence for a demand.

4.  The talkbank.org/data site has a lot of fascinating video linked to
transcripts for those who believe that a full study of discourse requires
not only transcripts and audio, but also video.  Examples include PBL
instruction in med school, clinical interviews, meetings with parolees, talk
shows, classroom discourse, and on and on.  We even have databases on bird
song, macaque calls, and meerkat squeaks.

Three major goals for the near term here are (1) to try to improve the links
between these resources so that users do not have to wander through a
labyrinth of URLs, formats, special permissions, (2) to link all discourse
to audio, and (3) to broaden coverage across languages and discourse types.
John Haviland's data at http://talkbank.org/data/exploration/Haviland/
illustrates the latter direction.

Suggestions for additions to both TalkBank and CHILDES are welcome, as well
as requests for new programs and data formats.

Let me also note that TalkBank and CHILDES data are freely downloadable
through the web, but we ask that users follow the ground rules given at the
sites.

--Brian MacWhinney, CMU



More information about the Funknet mailing list