[Corpora-List] XML parsers vs regex

Mark A. Greenwood m.greenwood at dcs.shef.ac.uk
Mon Jun 30 16:10:40 UTC 2014


On 30/06/14 17:02, Milos Jakubicek wrote:
> 2014-06-30 16:13 GMT+02:00 Darren Cook <darren at dcook.org>:
>
>> But apart from that, using regexes is normally fine for corpora,
> Exactly. Though XML-aware tools (like XPath) look like "the right
> thing", you should try to avoid them as far as you only can. A regexp
> will be always faster, simpler, easier to understand for others.
I'd strongly disagree with this.

Using regexp is fine when you know that your data is all from the same
source with *exactly* the same formatting, but as you soon as you have
multiple people/groups/tools producing data then even if they are
supposed to be producing identical output differences will creep in and
regexp will fail. Someone only has to add an attribute to a tag to break
some of these regex examples. Treating XML as "strings with angle
brackets" (for generation or parsing) is never a good idea. If the file
is supposed to be XML, use an XML parser it will be easier/faster in the
long run,

Mark

>
> Note that the W3C is developing a somewhat less known set of tools
> called HTML-XML-Utils (http://www.w3.org/Tools/HTML-XML-utils/) which
> includes a number of utilities that make processing XML files with
> line-based unix programs (grep, sed, awk, ...) a lot easier, e.g.
> hxpipe and hxnormalize. I would certainly recommend having a look at
> these.
>
> Best,
> Milos
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list