span?

David Reitter reitter at MLE.MEDIA.MIT.EDU
Tue Aug 26 09:51:57 UTC 2003


Aliza Yahav wrote:

> Could someone explain to me exactly how a 'span' is determined? Is it
> a grammatical unit? A pragmatic unit?

It depends on your approach. People have described "ideal" views yesterday
on this list, and I can add a description of the shallow segmentation
approach we have taken when we collected the Potsdam corpus of RST annotated
German newspaper texts:

We wanted an inexpensive  and automatic heuristic algorithm to create
segments that come close to clauses. We POS tagged each text and applied the
following heuristics:

- Every text span between sentence-boundary punctuation marks (period ., !,
:, ?) is a discourse unit
- every text span between sentence-boundary punctuation marks and/or a comma
that contains a finite verb is a discourse unit.

Of course, the results are far from perfect -- however, they provide a
consistent and cheap segmentation. To my surprise, human annotators, who
built trees using this automatic segmentation, could work with that fairly
well. Now, one could/should conduct an evaluation of this heuristics by
comparing it to the manual segmentation achieved in Dan Marcu's LDC corpus.

Best
David Reitter


--
David Reitter
Research Fellow. Adaptive Speech Interfaces
MIT Media Lab Europe, Dublin, Ireland.
www.mle.media.mit.edu



More information about the Rstlist mailing list