Corpora: sgml detagger
Michael Betsch
Michael.Betsch at uni-tuebingen.de
Wed Apr 17 07:44:54 UTC 2002
It will probably be more easy to use an existing sgml parser than to
write a script that can really identify _all_ possible tags and
remove them.
The (freely available) parser onsgmls has in its output format all
data content on lines of their own, which are prefixed by a "-". So
you can simply run onsgmls on your sgml-files and retain only those
lines that start with "-". (using 'grep -e "^-"'); then you can
easily remove the leading "-" with perl or something similar. This
assumes that all data is good and not e.g. a javascript, which you
will probably not want to include in your corpus.
--
_______________________________________________________________________
Dr. Michael Betsch privat:
SFB 441, Projekt B1
Nauklerstraße 35 Rappenberghalde 27
72074 Tübingen 72070 Tübingen
Tel. 07071/29-77161 Tel. 07071/51917
email: Michael.Betsch at uni-tuebingen.de
_______________________________________________________________________
More information about the Corpora
mailing list