Corpora: Santa Barbara Corpus

Christopher Cieri ccieri at ldc.upenn.edu
Tue Aug 8 00:12:21 UTC 2000


Lou, Chris,

Thanks for reminding us of a crucial issue in corpus distribution. Implicit in
this discussion is, I think, acknowledgement that any single format will be
appealing to those research communities who have adopted it but not
necessarily to others. We chose the particular formats used for SBCSAE after
consulting with the corpus developers who felt that those formats would be
most appropriate to the research communities most likely to use the data.
However, I don't want to imply that this closes the discussion. As you know,
LDC is very interested in the issues of standards and tools for access to
shared data and the problems of corpus reuse and reannotation (see
http://www.itl.nist.gov/iaui/894.01/atlas/,
http://www.ldc.upenn.edu/sb/isle.html, http://www.talkbank.org/,
http://www.ldc.upenn.edu/Papers/LREC2000/multiuse.pdf). We welcome suggestions
on ways to make our corpora more useful and would certainly consider any
reasonable request from a research community to provide data in an alternate
format.

Best wishes,
Chris

Lou Burnard wrote:

> On Mon, 7 Aug 2000, Chris Manning wrote:
>
> |On 7 August 2000, Lou Burnard wrote:
> | > Hmm. So instead of using pre-existing standards which at least have a
> | > chance of being implemented across different computer platforms, it's
> | > better to make up an entirely arbitrary set of codes of your own for
> | > which *everyone* has to write their own software?
> |
> |This is a little harsh.  The transcription format used has existed and
> |been developed for many years in the conversational/discourse analysis
> |community -- and versions of it can be found in books such as Edwards'
> |Talking Data: Transcription and Coding in Discourse Research or
> |Schiffrin's Approaches to Discourse.
> |
> |At most the LDC could be faulted for leaving the data in such a format
> |-- one clearly designed more for human observation than easy computer
> |manipulation -- rather than converting it to a more computer friendly
> |standard markup.
>
> Fair point, well made. Thanks Chris! Put the harshness down to my
> general gloom at being confronted with 300 email messages after ten
> days on a beach in Northern Portugal... But the devil in all digital
> affairs is in the detail and it's that phrase "versions of it" that gives
> away why it's a retrograde step for a project with such high visibility,
> importance, and resourcing to distribute such wonderful data in a way
> that makes it REALLY DIFFICULT for a computer to analyse it. If we're only
> concerned to produce data for humans to read, let's print it out on bits
> of paper.
>
> Lou
>
>  ----------------------------------------------------------------
>  Lou Burnard                           http://users.ox.ac.uk/~lou
>  ----------------------------------------------------------------

--
Christopher Cieri
Executive Director, Linguistic Data Consortium
3615 Market Street, Philadelphia, PA 19104-2608 USA
phone: 215-573-5489, fax: 215-573-2175
mailto:Christopher.Cieri at ldc.upenn.edu
http://www.ldc.upenn.edu

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ccieri.vcf
Type: text/x-vcard
Size: 321 bytes
Desc: Card for Christopher Cieri
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20000807/ea1c8721/attachment.vcf>


More information about the Corpora mailing list