[Corpora-List] XML parsers vs regex

Yannick Versley versley at cl.uni-heidelberg.de
Tue Jul 1 07:11:06 UTC 2014


>
> Sure. No doubt that if you're trying to write a general purpose script to
> extract whatever from arbitrary XML, you should, *no doubt*, use a DOM or
> SAX parser and properly treat the XML source as an XML document.
>
With elementtree.iterparse (for Python) and StAX, I don't think anyone
should be forced, or even feel forced, to use SAX.
They're also both much superior to regexp's when it comes medium-complexity
tasks that go beyond Perl one-liners.

Even better, yesterday's humongous multi-hundred megabyte XML blobs that
used to be unprocessable back then
now easily fit into a multi-gigabyte DOM tree (for Java, where the overhead
used to be the greatest) and you're perfectly fine
in cases where you need to jump back and forth in the structure.

To amend someone else's response: writing XML by using print statements is
actually ok as long as you remember to
use xml.sax.saxutils' escape (or any equivalent) and are in the clear about
encodings.

@Angus: MUC's SGML and other original-text-with-tags file formats are a
pain to process with XML tools because many XML tools
assume that you can throw out the whitespace around the tags. The only
solution that I found to work in that case is writing your own
parser for that file format, because that's what the people who defined
this tastes-like-XML format also used.

Cheers,
Yannick

-- 
Dr. Yannick Versley

Institut für Computerlinguistik
Universität Heidelberg
Im Neuenheimer Feld 325
69120 Heidelberg

Tel.: +49-6221 54 3591
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140701/133c42a2/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list