span?
David Reitter
reitter at MLE.MEDIA.MIT.EDU
Tue Aug 26 09:51:57 UTC 2003
Aliza Yahav wrote:
> Could someone explain to me exactly how a 'span' is determined? Is it
> a grammatical unit? A pragmatic unit?
It depends on your approach. People have described "ideal" views yesterday
on this list, and I can add a description of the shallow segmentation
approach we have taken when we collected the Potsdam corpus of RST annotated
German newspaper texts:
We wanted an inexpensive and automatic heuristic algorithm to create
segments that come close to clauses. We POS tagged each text and applied the
following heuristics:
- Every text span between sentence-boundary punctuation marks (period ., !,
:, ?) is a discourse unit
- every text span between sentence-boundary punctuation marks and/or a comma
that contains a finite verb is a discourse unit.
Of course, the results are far from perfect -- however, they provide a
consistent and cheap segmentation. To my surprise, human annotators, who
built trees using this automatic segmentation, could work with that fairly
well. Now, one could/should conduct an evaluation of this heuristics by
comparing it to the manual segmentation achieved in Dan Marcu's LDC corpus.
Best
David Reitter
--
David Reitter
Research Fellow. Adaptive Speech Interfaces
MIT Media Lab Europe, Dublin, Ireland.
www.mle.media.mit.edu
More information about the Rstlist
mailing list