[Corpora-List] Custom tagging validator
Steven Bird
sb at csse.unimelb.edu.au
Tue Nov 8 11:16:34 UTC 2005
Here's a simple Python script to quickly check that a file contains
balanced, matched XML-style tags taken from a fixed set. It avoids
the need for a DTD and an artificial root element. For more on Python
for NLP, see nltk.sourceforge.net. -Steven Bird
----snip----
# Simple Python script to check a text file containing embedded XML tags
# Errors detected:
# - unbalanced tags: <a>afd</a> lakjf<a>
# - mismatched tags: <a>lakf</b>
# - illegal tags: <a>kafsd</a> lajf <x>lawq</x>
import sys, re
# check usage
if len(sys.argv) != 2:
print "Usage: %s filename" % sys.argv[0]
sys.exit(1)
# read file into string
text = open(sys.argv[1]).read()
# the permissible tags, associated regexps
tags = ("a", "b")
legal_tag = re.compile(r"</?(?:%s)>" % "|".join(tags))
any_tag = re.compile(r"</?.*?>")
# get the sequence of legal tags, ignoring everything else
tag_seq = legal_tag.findall(text)
# check this sequence consists of paired begin-end tags
if len(tag_seq) % 2 != 0:
print "Unbalanced tags"
sys.exit(1)
for i in range(len(tag_seq), 2):
begin, end = tag_seq[i], tag_seq[i+1]
if begin[1:] != end[2:]:
print "Mismatched tags: %s, %s" % begin, end
sys.exit(1)
# remove all legal tags and report any others
residue = legal_tag.sub("", text)
tag_seq = any_tag.findall(residue)
if tag_seq:
print "Illegal tags:", " ".join(tag_seq)
sys.exit(1)
print "Correct use of tags:", " ".join(tags)
----snip----
On 11/8/05, neduchi at netscape.net <neduchi at netscape.net> wrote:
> Hallo,
> Please hae a look at this free xml editor: http://www.butterflyxml.org/
>
> May be it might be of help.
>
> Chinedu Uchechukwu
> Otto-Friedrich-Uniersität, Bamberg
>
>
> -----Original Message-----
> From: Przemek Kaszubski <przemka at amu.edu.pl>
> To: CORPORA at uib.no
> Sent: Sat, 05 Nov 2005 17:32:21 +0100
> Subject: [Corpora-List] Custom tagging validator
>
> Dear Members,
>
> I'm looking for a flexible tool that would validate files tagged by my
> students. The tags follow the <tag>tagged_text</tag> convention but are
> not linked to any DTD, and entirely my own. I'd like to be able to test
> quickly if my students spelled the tag names correctly, closed the
> tags, applied the < and > symbols etc. The tagging scheme is simple
> (sth like 10-12 tags in all), with no embedding or special properties.
>
> Does anyone know of a tool or script of this kind, or perhaps
> developed one?
>
> Thank you for any help,
>
> Przemek
>
> -- Dr Przemyslaw Kaszubski
> +48 61 8293515
> http://elex.amu.edu.pl/ifa/staff/kaszubski.html
>
> PICLE LEARNER CORPUS ONLINE:
> http://www.staff.amu.edu.pl/~przemka/picle.html
>
> COMPREHENSIVE CORPORA BIBLIOGRAPHY:
> http://www.staff.amu.edu.pl/~przemka
>
> MY SEMINARS:
> http://www.staff.amu.edu.pl/~przemka/seminars.htm
>
> ACADEMIC WRITING PAGE (FULL-TIME PROGRAMME):
> http://www.staff.amu.edu.pl/~przemka/IFA_writing
>
> =======================================
> School of English (IFA)
> Adam Mickiewicz University
> http://elex.amu.edu.pl/ifa
> =======================================
>
>
>
>
> ___________________________________________________
> Try the New Netscape Mail Today!
> Virtually Spam-Free | More Storage | Import Your Contact List
> http://mail.netscape.com
>
>
>
More information about the Corpora
mailing list