<div dir="ltr">Regex can be useful as a fallback if you are processing XML from an unknown source or of dubious quality and you just want to extract a few fields. <div><br></div><div>E.g. if you're pulling in TEI XML that fails to parse, maybe catch the exception and then fall back to a regex to extract a title, author, abstract etc.</div>
<div><br></div><div>Phil</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Jun 30, 2014 at 5:44 PM, Piotr Bański <span dir="ltr"><<a href="mailto:bansp@o2.pl" target="_blank">bansp@o2.pl</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Dear Matías,<br>
<br>
This topic sometimes briefly surfaces on the xml-dev list, before it<br>
goes out in flame. You might want to check the archives at:<br>
<br>
<a href="http://lists.xml.org/archives/xml-dev/" target="_blank">http://lists.xml.org/archives/xml-dev/</a><br>
<br>
In most cases, the reply, naturally, begins with "it depends": for<br>
trivial cases addressing simple embedded markup, why not regex, but for<br>
more complex cases, you may want to start thinking vis-a-vis the<br>
complexity of the source (see Mike Maxwell's reply for starters) and the<br>
complexity of what you want to retrieve, and then please do not forget<br>
to think about making your queries portable and verifiable/readable for<br>
others, especially those of us who aren't regex-geeks.<br>
<br>
Best regards,<br>
<br>
Piotr<br>
<br>
On 30/06/14 13:55, Matías Guzmán Naranjo wrote:<br>
> Dear all,<br>
><br>
> When working with xml tagged corpora I have always used regex to extract<br>
> the information I need, I have never used xml parsers like nltk's or any<br>
> other. Is there an advantage to using parsers vs using regex? Which?<br>
> what do you personally use?<br>
><br>
> Best,<br>
><br>
> Matías<br>
><br>
><br>
> _______________________________________________<br>
> UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>
> Corpora mailing list<br>
> <a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>
> <a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
><br>
<br>
<br>
_______________________________________________<br>
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
</blockquote></div><br></div>