<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
Sure. No doubt that if you're trying to write a general purpose script to extract whatever from arbitrary XML, you should, *no doubt*, use a DOM or SAX parser and properly treat the XML source as an XML document.<br></blockquote>
<div>With elementtree.iterparse (for Python) and StAX, I don't think anyone should be forced, or even feel forced, to use SAX.</div><div>They're also both much superior to regexp's when it comes medium-complexity tasks that go beyond Perl one-liners.</div>
<div><br></div><div>Even better, yesterday's humongous multi-hundred megabyte XML blobs that used to be unprocessable back then</div><div>now easily fit into a multi-gigabyte DOM tree (for Java, where the overhead used to be the greatest) and you're perfectly fine</div>
<div>in cases where you need to jump back and forth in the structure.</div><div><br></div><div>To amend someone else's response: writing XML by using print statements is actually ok as long as you remember to</div><div>
use xml.sax.saxutils' escape (or any equivalent) and are in the clear about encodings.</div><div><br></div><div>@Angus: MUC's SGML and other original-text-with-tags file formats are a pain to process with XML tools because many XML tools</div>
<div>assume that you can throw out the whitespace around the tags. The only solution that I found to work in that case is writing your own</div><div>parser for that file format, because that's what the people who defined this tastes-like-XML format also used.</div>
<div> </div><div>Cheers,</div><div>Yannick</div><div><br></div><div>-- <br><div dir="ltr">Dr. Yannick Versley<div><br><div>Institut für Computerlinguistik</div></div><div>Universität Heidelberg</div><div>Im Neuenheimer Feld 325</div>
<div>69120 Heidelberg</div><div><br></div><div>Tel.: +49-6221 54 3591</div></div></div></div></div></div>