Corpora: Santa Barbara Corpus

Mon Aug 7 15:55:38 UTC 2000

On Mon, 7 Aug 2000, Chris Manning wrote:

|On 7 August 2000, Lou Burnard wrote:
| > Hmm. So instead of using pre-existing standards which at least have a
| > chance of being implemented across different computer platforms, it's
| > better to make up an entirely arbitrary set of codes of your own for
| > which *everyone* has to write their own software?
|
|This is a little harsh.  The transcription format used has existed and
|been developed for many years in the conversational/discourse analysis
|community -- and versions of it can be found in books such as Edwards'
|Talking Data: Transcription and Coding in Discourse Research or
|Schiffrin's Approaches to Discourse.
|
|At most the LDC could be faulted for leaving the data in such a format
|-- one clearly designed more for human observation than easy computer
|manipulation -- rather than converting it to a more computer friendly
|standard markup.

Fair point, well made. Thanks Chris! Put the harshness down to my
general gloom at being confronted with 300 email messages after ten
days on a beach in Northern Portugal... But the devil in all digital
affairs is in the detail and it's that phrase "versions of it" that gives
away why it's a retrograde step for a project with such high visibility,
importance, and resourcing to distribute such wonderful data in a way
that makes it REALLY DIFFICULT for a computer to analyse it. If we're only
concerned to produce data for humans to read, let's print it out on bits
of paper.

Lou

 ----------------------------------------------------------------
 Lou Burnard                           http://users.ox.ac.uk/~lou
 ----------------------------------------------------------------