[Corpora-List] XML parsers vs regex

Milos Jakubicek jak at fi.muni.cz
Mon Jun 30 16:02:56 UTC 2014


2014-06-30 16:13 GMT+02:00 Darren Cook <darren at dcook.org>:

> But apart from that, using regexes is normally fine for corpora,

Exactly. Though XML-aware tools (like XPath) look like "the right
thing", you should try to avoid them as far as you only can. A regexp
will be always faster, simpler, easier to understand for others.

Note that the W3C is developing a somewhat less known set of tools
called HTML-XML-Utils (http://www.w3.org/Tools/HTML-XML-utils/) which
includes a number of utilities that make processing XML files with
line-based unix programs (grep, sed, awk, ...) a lot easier, e.g.
hxpipe and hxnormalize. I would certainly recommend having a look at
these.

Best,
Milos

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list