[Corpora-List] Summary: Custom tagging validator

Przemek Kaszubski przemka at amu.edu.pl
Mon Nov 21 20:52:28 UTC 2005


On 5 November I announced the following request:

"I'm looking for a flexible tool that would validate files tagged by my 
students. The tags follow the <tag>tagged_text</tag> convention but are 
not linked to any DTD, and entirely my own. I'd like to be able to test 
quickly if my students spelled the tag names correctly, closed the tags, 
applied the < and > symbols etc. The tagging scheme is simple (sth like 
10-12 tags in all), with no embedding or special properties."

Well, it turned out what I really needed was a tool for checking mostly 
well-formedness and some validity, given our very simple tagging scheme 
with only two nestings. As the student project I am coordinating 
expands, we may need to put in place some of the robust validation 
tools. Meanwhile we have been settled on the simple solution of 
combining a browser's (Firefox) XML parsing facility and simple file 
editing – as suggested by Rafał L. Górski and Mark P. Line.

The other suggestions, which I may need to consider in the near future, 
are briefly reported below in chronological order. I have put in simple 
comments, labelled PK, for those looking for info on the tool's 
applicability to my immediate purpose.

Many thanks to all that replied!

-----------

Lou Burnard suggested any xml validator, such as xsltproc

PK: available with the Cygwin packages libxml2 and libxslt, apparently 
assumes well-formedness, however

-----------

Kiril Simov suggested CLaRK system 
(http://www.bultreebank.org/clark/index.html)

PK: I thought this being too complex for the task, especially for my 
computer-unsavvy students

-----------

David Graff: provided me with a perl script/filter, and generally 
advised the development of a customized editing tool for studebts as a 
more reliable solution
PK: Perl script: no nesting or attributes were supported, which I, sadly 
after the original post, added to the files ...:

#!/usr/bin/perl

# Simple script to check for certain error conditions involving
# strings enclosed within angle brackets:
#  - for each "<tag>", the next angle-bracketed string must be "</tag>"
#  - tag names are purely alphanumeric, with no attributes
#  - tags do not embed

# For a given input of tagged text, the output is a listing of tags found
# and their frequency of occurrence, along with any warnings about
# violations of the above conditions.

use strict;

die "Usage:  $0  tagged_file.txt\n" if ( @ARGV == 0 and -t STDIN );

my $text = do { local $/; <> };  # read entire file into $text
my @segs = split( m{(</?\w+>)}, $text );  # split into data and tags

my $linenum = 1;
my $expect = '';
my %taghist;

for ( @segs ) {
    if ( /^<(\w+)>$/ ) {  # this is an open-tag
        my $tag = uc $1;
        $taghist{"$tag Open"}++;
        if ( $expect ) {  # true if we're expecting a close-tag
            warn "found <$tag>, expecting </$expect> at line $linenum\n";
        }
        $expect = $tag;
    }
    elsif ( m{^</(\w+)>$} ) { # this is a close-tag
        my $tag = uc $1;
        $taghist{"$tag close"}++;
        if ( $tag ne $expect ) {  # this close tag is wrong
            my $wanted = ( $expect ) ? "</$expect>" : "an open tag";
            warn "found </$tag>, expecting $wanted at line $linenum\n";
        }
        $expect = '';
    }
    elsif ( 0 == tr/<>// ) { # text with no angle-brackets
        $linenum += tr/\n//;
    }
    else {  # angle bracket(s) that are not part of a valid tag
        my @lines = split "\n";
        for my $l ( @lines ) {
            warn "bad angle bracket(s) at line $linenum\n" if ( $l =~ /[<>]/ );
            $linenum++;
        }
    }
}

printf( "%5d  %s\n", $taghist{$_}, $_ ) for ( sort keys %taghist );

__END__

-----------

Valentin Jijkoun: suggested xmllint (Linux) and xmlvalid (web-based, 
http://www.stg.brown.edu/service/xmlvalid/)

PK: the latter requires DTD, which I wanted to avoid

----------

Ken Beesley: suggested Relax NG 
(http://www.thaiopensource.com/relaxng/jing.html), requiring DTD 
(http://relaxng.org/compact-tutorial-20030326.html), and provided a 
short tutorial to the system

PK: requires DTD

----------

Ken Litkowski: kindly sent me his XML tools!

PK: they can do so much more...

-------

Jin-Dong: well-formedness: xmlwf; validity: xmllint (both available with 
Cygwin, also open source implementations available)

-----------

Mario Barcala: suggested rxp (Linux)

-------------

Rafał L. Górski: use IE or Mozilla, alternatively Altova XMLSpy (home 
edition free, http://www.altova.com/download_spy_home.html)

PK: the latter tool may be overkill, but supposedly is efficient

----------

Chinedu Uchechukwu (Bamberg): use Butterfly XML (opensource xml editor, 
java)

PK: editor + parser, looks promising

-----------

Steven Bird: provided python script

PK: one can provide tags, and it will look for errors (attributes 
probably unsupported)

----snip----
# Simple Python script to check a text file containing embedded XML tags
# Errors detected:
# - unbalanced tags: <a>afd</a> lakjf<a>
# - mismatched tags: <a>lakf</b>
# - illegal tags:    <a>kafsd</a> lajf <x>lawq</x>

import sys, re

# check usage
if len(sys.argv) != 2:
    print "Usage: %s filename" % sys.argv[0]
    sys.exit(1)

# read file into string
text = open(sys.argv[1]).read()

# the permissible tags, associated regexps
tags = ("a", "b")
legal_tag = re.compile(r"</?(?:%s)>" % "|".join(tags))
any_tag = re.compile(r"</?.*?>")

# get the sequence of legal tags, ignoring everything else
tag_seq = legal_tag.findall(text)

# check this sequence consists of paired begin-end tags
if len(tag_seq) % 2 != 0:
    print "Unbalanced tags"
    sys.exit(1)
for i in range(len(tag_seq), 2):
    begin, end = tag_seq[i], tag_seq[i+1]
    if begin[1:] != end[2:]:
        print "Mismatched tags: %s, %s" % begin, end
        sys.exit(1)

# remove all legal tags and report any others
residue = legal_tag.sub("", text)
tag_seq = any_tag.findall(residue)
if tag_seq:
    print "Illegal tags:", " ".join(tag_seq)
    sys.exit(1)

print "Correct use of tags:", " ".join(tags)
----snip----




-- 
Dr Przemyslaw Kaszubski
+48 61 8293515
http://elex.amu.edu.pl/ifa/staff/kaszubski.html

PICLE LEARNER CORPUS ONLINE:
http://www.staff.amu.edu.pl/~przemka/picle.html

COMPREHENSIVE CORPORA BIBLIOGRAPHY:
http://www.staff.amu.edu.pl/~przemka

MY SEMINARS:
http://www.staff.amu.edu.pl/~przemka/seminars.htm

ACADEMIC WRITING PAGE (FULL-TIME PROGRAMME):
http://www.staff.amu.edu.pl/~przemka/IFA_writing

=======================================
School of English (IFA)
Adam Mickiewicz University
http://elex.amu.edu.pl/ifa
=======================================



More information about the Corpora mailing list