<div dir="ltr">Regex can be useful as a fallback if you are processing XML from an unknown source or of dubious quality and you just want to extract a few fields. <div><br></div><div>E.g. if you're pulling in TEI XML that fails to parse, maybe catch the exception and then fall back to a regex to extract a title, author, abstract etc.</div>


<div><br></div><div>Phil</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Jun 30, 2014 at 5:44 PM, Piotr Bański <span dir="ltr"><<a href="mailto:bansp@o2.pl" target="_blank">bansp@o2.pl</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Dear Matías,<br>

<br>

This topic sometimes briefly surfaces on the xml-dev list, before it<br>

goes out in flame. You might want to check the archives at:<br>

<br>

<a href="http://lists.xml.org/archives/xml-dev/" target="_blank">http://lists.xml.org/archives/xml-dev/</a><br>

<br>

In most cases, the reply, naturally, begins with "it depends": for<br>

trivial cases addressing simple embedded markup, why not regex, but for<br>

more complex cases, you may want to start thinking vis-a-vis the<br>

complexity of the source (see Mike Maxwell's reply for starters) and the<br>

complexity of what you want to retrieve, and then please do not forget<br>

to think about making your queries portable and verifiable/readable for<br>

others, especially those of us who aren't regex-geeks.<br>

<br>

Best regards,<br>

<br>

  Piotr<br>

<br>

On 30/06/14 13:55, Matías Guzmán Naranjo wrote:<br>

> Dear all,<br>

><br>

> When working with xml tagged corpora I have always used regex to extract<br>

> the information I need, I have never used xml parsers like nltk's or any<br>

> other. Is there an advantage to using parsers vs using regex? Which?<br>

> what do you personally use?<br>

><br>

> Best,<br>

><br>

> Matías<br>

><br>

><br>

> _______________________________________________<br>

> UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>

> Corpora mailing list<br>

> <a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>

> <a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

><br>

<br>

<br>

_______________________________________________<br>

UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

</blockquote></div><br></div>