Corpora: sgml detagger

Vlado Keselj vkeselj at cs.uwaterloo.ca
Tue Apr 16 20:21:47 UTC 2002


On Tue, 16 Apr 2002, Alexander S. Yeh wrote:

> The script below will work for most tags, but may fail in the following
> more complicated cases:
>
> 1. A tag is spread out over more than 1 line (usual cases: comment tags,
> tags with attribute/value pairs).
>
> 2. A tag has an attribute value that has a ">" in it.
>
> 3. A comment tag has a ">" embedded in it.
>
> I have encountered these in html files of journal articles gotten off
> the web. Thanks.
>
> -Alex Yeh

True.

Actually, writting a correct and general SGML detagger would be a *very*
difficult task.  The actual document processing depends on a DTD, which
can define very flexible syntax.  The difficulty of general SGML parser
was one of the main reasons to come up with XML.

However, removing comments and tags from an HTML, XML, or typical SGML
document should not be so difficult task.  I just wrote a script to do it
and it is appended below.  Please report any bugs that you find.

Note that it follows the strict rules for HTML (SGML) comments, which may
be counter-intuitive, and I would not bet that all browsers (not to
mention users) observe them.  The rules say that a comment may be <!>, or
it starts with <!--.  If it starts with <!--, then it finishes with --.
After -- and possibly some whitespace we can either finish the comment tag
with > or start new comment with --.

Vlado

#!/usr/bin/perl
# 2002 Vlado Keselj <vkeselj at cs.uwaterloo.ca>
# Version: 0.1
# The newest version can be found at:
# http://vlado.keselj.net/srcperl/
#
# Cleans HTML tags.
# Warning: Follows strict HTML syntax for comments (which may be
# counter-intuitive), e.g., valid comments are:
# <!> <!-- cm --> <!-- comment 1 ---- comment2 -- -- c3 -- >
# and invalid comments are:
# <!-- comment 1 -- ERR --> <!-- comment 1 -- --> NOT FINISHED

$state = 'normal';

while (<>) {
    while ($_) {
	if ($state eq 'normal') {
	    if (/^([^<]*)<!>/)  { print $1; $_ = $'; }
	    elsif (/^([^<]*)<!--/) {
		print $1; $_ = $'; $state = 'comment';
	    }
	    elsif (/^([^<]*)</) {
		print $1; $_ = $'; $state = 'tag';
	    }
	    else { print; $_ = ''; }
	}
	elsif ($state eq 'comment') {
	    if (/--/) { $_ = $'; $state = 'betweencomments'; }
	    else { $_ = '' }
	}
	elsif ($state eq 'betweencomments') {
	    if (/^\s*>/) { $_ = $'; $state = 'normal' }
	    elsif (/^\s*--/) { $_= $'; $state = 'comment'; }
	    elsif (/^\s*$/) { $_ = '' }
	    else { die "IMPROPER HTML COMMENT" }
	}
	elsif ($state eq 'tag') {
	    if (/^[^>\"\']*([>\'\"])/) {
		$_ = $';
		if ($1 eq '>') { $state = 'normal' }
		else { $state = 'quote'; $quote = $1; }
	    }
	    else { $_ = '' }
	}
	elsif ($state eq 'quote') {
	    if (/$quote/) { $_ = $'; $state = 'tag' }
	    else { $_ = '' }
	}
	else { die "UNKNOWN STATE ($state)" }
    }
}



More information about the Corpora mailing list