Corpora: sgml detagger

David Graff graff at unagi.cis.upenn.edu
Tue Apr 16 20:11:36 UTC 2002


> Does anybody know an easy way to remove [sgml] tags and save the texts
> as 'raw' .txt files?
> Maybe a PERL script?

Perl is very good for this.  If you're confident that _all_ the text
data in the sgml files (i.e. everything that is not an sgml tag) is
usable for down-stream processing, then this perl script would work
(even when sgml tags span multiple lines):

#!/usr/bin/perl

# set input record separator to empty string
# (entire content of input file will be fetched in a single read):

$/ = "";

# assume that command line args are file names to be converted;
# for each input file, read it and write "file_name.raw"

foreach $file ( @ARGV ) {
    open( IN, $file ) or do { warn "can't open $file\n"; next; };
    $data = <IN>;
    close IN;
    (defined $data) or do { warn "can't read data from $file\n"; next; };

    $data =~ s/<[^>]+>//g;	# remove tags (strings bounded by "<...>")
    $data =~ s/\n\s+/\n/g;	# remove blank lines (not essential)

    open( OUT, ">$file.raw" ) or do { warn "can't write $file.raw\n"; next; };
    print OUT $data or warn "can't write data to $file.raw\n";
    close OUT or die "error trying to close $file.raw\n";
}

__END__

However, it is not uncommon for sgml files to contain tags whose data
content is not human language; for example, you might find markup like
the following:

<DOC>
<DOCNO> AP891231-0001 </DOCNO>
<FILEID>AP-NR-12-31-89 2359EDT</FILEID>
<FIRST>r a PM-MonkeyBusiness     12-31 0269</FIRST>
<SECOND>PM-Monkey Business,0276</SECOND>
<HEAD>Yacht That Took Gary Hart On Famous Cruise Suffered From Fame</HEAD>
<DATELINE>DENVER (AP) </DATELINE>
<TEXT>
   Monkey Business, the yacht that helped sink Gary
Hart's presidential aspirations in 1988, is for sale, and its
...

(This example is drawn from an sgml file in the TIPSTER corpus.)  The
point is that you might want to filter out more than just the sgml tags,
if your down-stream process is going to treat everything that remains as
language data.

If the sgml markup makes it easy to identify what portion(s) you want to
keep, then a couple additions to the Perl script above would suffice --
e.g. for the TIPSTER case, you could add these two lines just before the
line that removes all the tags:

   $data =~ s/^.*<TEXT>//s;  # remove everything up to/including <TEXT>
   $data =~ s%</TEXT>.*%%s;  # remove </TEXT> and everything after it

Depending on where your sgml files came from -- and if you have the DTD
that they are supposed to be based on -- it may be a good idea to
validate the tagging first, using a standard sgml parser, like nsgmls;
it's hard to create any kind of useful sgml filter when there are
mistakes in the tagging.  

For that matter, it's probably easier/safer to write a filter that works
on the output of an sgml parser, rather than the sgml file.

Best regards,

	Dave Graff



More information about the Corpora mailing list