Corpora: sgml detagger

Vlado Keselj vkeselj at cs.uwaterloo.ca
Tue Apr 16 20:40:28 UTC 2002


On Tue, 16 Apr 2002, David Graff wrote:

>
> > Does anybody know an easy way to remove [sgml] tags and save the texts
> > as 'raw' .txt files?
> > Maybe a PERL script?
>
> Perl is very good for this.  If you're confident that _all_ the text
> data in the sgml files (i.e. everything that is not an sgml tag) is
> usable for down-stream processing, then this perl script would work
> (even when sgml tags span multiple lines):

You also have to assume that:
 - no quoted strings in tags contain > sign (e.g., <div id="><">)
 - there are no comments that include > sign, and
 - each file is not too large so it can fit in the memory

Vlado

>
> #!/usr/bin/perl
>
> # set input record separator to empty string
> # (entire content of input file will be fetched in a single read):
>
> $/ = "";
>
> # assume that command line args are file names to be converted;
> # for each input file, read it and write "file_name.raw"
>
> foreach $file ( @ARGV ) {
>     open( IN, $file ) or do { warn "can't open $file\n"; next; };
>     $data = <IN>;
>     close IN;
>     (defined $data) or do { warn "can't read data from $file\n"; next; };
>
>     $data =~ s/<[^>]+>//g;	# remove tags (strings bounded by "<...>")
>     $data =~ s/\n\s+/\n/g;	# remove blank lines (not essential)
>
>     open( OUT, ">$file.raw" ) or do { warn "can't write $file.raw\n"; next; };
>     print OUT $data or warn "can't write data to $file.raw\n";
>     close OUT or die "error trying to close $file.raw\n";
> }
>
> __END__
>
> However, it is not uncommon for sgml files to contain tags whose data
> content is not human language; for example, you might find markup like
> the following:
>
> <DOC>
> <DOCNO> AP891231-0001 </DOCNO>
> <FILEID>AP-NR-12-31-89 2359EDT</FILEID>
> <FIRST>r a PM-MonkeyBusiness     12-31 0269</FIRST>
> <SECOND>PM-Monkey Business,0276</SECOND>
> <HEAD>Yacht That Took Gary Hart On Famous Cruise Suffered From Fame</HEAD>
> <DATELINE>DENVER (AP) </DATELINE>
> <TEXT>
>    Monkey Business, the yacht that helped sink Gary
> Hart's presidential aspirations in 1988, is for sale, and its
> ...
>
> (This example is drawn from an sgml file in the TIPSTER corpus.)  The
> point is that you might want to filter out more than just the sgml tags,
> if your down-stream process is going to treat everything that remains as
> language data.
>
> If the sgml markup makes it easy to identify what portion(s) you want to
> keep, then a couple additions to the Perl script above would suffice --
> e.g. for the TIPSTER case, you could add these two lines just before the
> line that removes all the tags:
>
>    $data =~ s/^.*<TEXT>//s;  # remove everything up to/including <TEXT>
>    $data =~ s%</TEXT>.*%%s;  # remove </TEXT> and everything after it
>
> Depending on where your sgml files came from -- and if you have the DTD
> that they are supposed to be based on -- it may be a good idea to
> validate the tagging first, using a standard sgml parser, like nsgmls;
> it's hard to create any kind of useful sgml filter when there are
> mistakes in the tagging.
>
> For that matter, it's probably easier/safer to write a filter that works
> on the output of an sgml parser, rather than the sgml file.
>
> Best regards,
>
> 	Dave Graff
>
>
>



More information about the Corpora mailing list