Corpora: sgml detagger

Michael Betsch Michael.Betsch at uni-tuebingen.de
Wed Apr 17 07:44:54 UTC 2002


It will probably be more easy to use an existing sgml parser than to
write a script that can really identify _all_ possible tags and
remove them.

The (freely available) parser onsgmls has in its output format all
data content on lines of their own, which are prefixed by a "-". So
you can simply run onsgmls on your sgml-files and retain only those
lines that start with "-". (using 'grep -e "^-"'); then you can
easily remove the leading "-" with perl or something similar. This
assumes that all data is good and not e.g. a javascript, which you
will probably not want to include in your corpus.
--

_______________________________________________________________________
Dr. Michael Betsch                                              privat:
SFB 441, Projekt B1
Nauklerstraße 35                                     Rappenberghalde 27
72074 Tübingen                                           72070 Tübingen
Tel. 07071/29-77161                                    Tel. 07071/51917
email: Michael.Betsch at uni-tuebingen.de
_______________________________________________________________________



More information about the Corpora mailing list