[Corpora-List] Custom tagging validator

Steven Bird sb at csse.unimelb.edu.au
Tue Nov 8 11:16:34 UTC 2005


Here's a simple Python script to quickly check that a file contains
balanced, matched XML-style tags taken from a fixed set.  It avoids
the need for a DTD and an artificial root element.  For more on Python
for NLP, see nltk.sourceforge.net.  -Steven Bird

----snip----
# Simple Python script to check a text file containing embedded XML tags
# Errors detected:
# - unbalanced tags: <a>afd</a> lakjf<a>
# - mismatched tags: <a>lakf</b>
# - illegal tags:    <a>kafsd</a> lajf <x>lawq</x>

import sys, re

# check usage
if len(sys.argv) != 2:
    print "Usage: %s filename" % sys.argv[0]
    sys.exit(1)

# read file into string
text = open(sys.argv[1]).read()

# the permissible tags, associated regexps
tags = ("a", "b")
legal_tag = re.compile(r"</?(?:%s)>" % "|".join(tags))
any_tag = re.compile(r"</?.*?>")

# get the sequence of legal tags, ignoring everything else
tag_seq = legal_tag.findall(text)

# check this sequence consists of paired begin-end tags
if len(tag_seq) % 2 != 0:
    print "Unbalanced tags"
    sys.exit(1)
for i in range(len(tag_seq), 2):
    begin, end = tag_seq[i], tag_seq[i+1]
    if begin[1:] != end[2:]:
        print "Mismatched tags: %s, %s" % begin, end
        sys.exit(1)

# remove all legal tags and report any others
residue = legal_tag.sub("", text)
tag_seq = any_tag.findall(residue)
if tag_seq:
    print "Illegal tags:", " ".join(tag_seq)
    sys.exit(1)

print "Correct use of tags:", " ".join(tags)
----snip----


On 11/8/05, neduchi at netscape.net <neduchi at netscape.net> wrote:
> Hallo,
> Please hae a look at this free xml editor: http://www.butterflyxml.org/
>
> May be it might be of help.
>
> Chinedu Uchechukwu
> Otto-Friedrich-Uniersität, Bamberg
>
>
> -----Original Message-----
> From: Przemek Kaszubski <przemka at amu.edu.pl>
> To: CORPORA at uib.no
> Sent: Sat, 05 Nov 2005 17:32:21 +0100
> Subject: [Corpora-List] Custom tagging validator
>
>   Dear Members,
>
>   I'm looking for a flexible tool that would validate files tagged by my
> students. The tags follow the <tag>tagged_text</tag> convention but are
> not linked to any DTD, and entirely my own. I'd like to be able to test
> quickly if my students spelled the tag names correctly, closed the
> tags, applied the < and > symbols etc. The tagging scheme is simple
> (sth like 10-12 tags in all), with no embedding or special properties.
>
>   Does anyone know of a tool or script of this kind, or perhaps
> developed one?
>
>  Thank you for any help,
>
>  Przemek
>
>  -- Dr Przemyslaw Kaszubski
>  +48 61 8293515
>  http://elex.amu.edu.pl/ifa/staff/kaszubski.html
>
>  PICLE LEARNER CORPUS ONLINE:
>  http://www.staff.amu.edu.pl/~przemka/picle.html
>
>  COMPREHENSIVE CORPORA BIBLIOGRAPHY:
>  http://www.staff.amu.edu.pl/~przemka
>
>  MY SEMINARS:
>  http://www.staff.amu.edu.pl/~przemka/seminars.htm
>
>  ACADEMIC WRITING PAGE (FULL-TIME PROGRAMME):
>  http://www.staff.amu.edu.pl/~przemka/IFA_writing
>
>  =======================================
>  School of English (IFA)
>  Adam Mickiewicz University
>  http://elex.amu.edu.pl/ifa
>  =======================================
>
>
>
>
> ___________________________________________________
> Try the New Netscape Mail Today!
> Virtually Spam-Free | More Storage | Import Your Contact List
> http://mail.netscape.com
>
>
>



More information about the Corpora mailing list