Corpora: Santa Barbara Corpus

Christopher Cieri ccieri at ldc.upenn.edu
Fri Aug 4 20:42:31 UTC 2000


Hi Ute,

I can certainly try to help. Because we expected the Santa Barbara
Corpus of Spoken American English (SBCSAE) to be used in multiple
research communties where different computer platforms and software are
common, we have tried to avoid depending upon any specific set of tools.
The corpus contains only data; there is no software to install. Indeed,
the data is stored on the CDs in uncompressed format so that you can
read the transcripts or listen to the audio directly from CD. To load
the entire corpus onto your hard drive would require between 2 and 3 GBs
of storage. However, if you have that much spare space, doing so would
certainly improve the response of whatever software you use.

There are three kinds of files in the /speech directory on each of the
CDs:
    1. The transcripts (.trn) are in plain text format. You can read
them with any editor or word processor (Notepad or WordPad under
Windows, SimpleText under MacOS, Word in either case, emacs or pico
under Unix).
    2. The audio files (.wav) are in Wave format. This is the probably
most commonly supported audio format on PCs and is increasingly
supported on Macs as well. On both PCs and Macs, you should be able to
use a basic sound player (Media Player under Windows) or appropriately
configured WWW browser to play the sound files.
    3. The .flt files indicate which regions of the audio files have
passed through an acoustic filter. The occasional bit of personal
information is marked in the transcripts with a tilda and is filtered to
make it unrecgonizeable in the corresponding audio files. The .flt files
show where the filtering occurs for folks who care about the acoustic
signal.

Note that the transcription specification for SBCASE is particularly
rich. You can find a description of it at:
http://linguistics.ucsb.edu/research/sbcorpus/. If you concordance the
transcripts and want to ignore some of the distinctions SBCSAE encodes,
you may need to remove some of the mark-up or train your software to
ignore it. To give one example, = is used in SBCSAE to mark lengthening.
Words have an = inserted in them to show that a segment has been
lengthened. If your work is unconcerned with lengthening, you may want
to remove or ignore all of the = so that "him" and "hi=m" are treated as
the same word. With that caveat, I expect most concordancing software to
work well with the corpus. I just downloaded the demo version of
WordSmith (www.liv.ac.uk/~ms2928/) and as far as I could tell it seemed
to do fine. If any of the software developers on the list anticipate
difficulties using SBCSAE with their software, I'd be interested to
know.

Please let me know if this helps.

Best wishes,
Chris


Ute Römer wrote:

>  Hi linguists! Some days ago a colleague of mine purchased the new
> Santa Barbara Corpus of Spoken American English - Part I. The problem
> is that she only received a three CD package and nothing else (no
> instructions how to install the corpus or how and with what kind of
> concordancer to use it). We tried to install the corpus yesterday but
> it didn't work. Could anybody on the list possibly help us? Thanks a
> lot in advance! Best wishesUteute.roemer at uni-koeln.de

--
Christopher Cieri
Executive Director, Linguistic Data Consortium
3615 Market Street, Philadelphia, PA 19104-2608 USA
phone: 215-573-5489, fax: 215-573-2175
mailto:Christopher.Cieri at ldc.upenn.edu
http://www.ldc.upenn.edu

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ccieri.vcf
Type: text/x-vcard
Size: 321 bytes
Desc: Card for Christopher Cieri
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20000804/d6eff20b/attachment-0001.vcf>


More information about the Corpora mailing list