[Corpora-List] XML parsers vs regex
Milos Jakubicek
jak at fi.muni.cz
Mon Jun 30 16:02:56 UTC 2014
2014-06-30 16:13 GMT+02:00 Darren Cook <darren at dcook.org>:
> But apart from that, using regexes is normally fine for corpora,
Exactly. Though XML-aware tools (like XPath) look like "the right
thing", you should try to avoid them as far as you only can. A regexp
will be always faster, simpler, easier to understand for others.
Note that the W3C is developing a somewhat less known set of tools
called HTML-XML-Utils (http://www.w3.org/Tools/HTML-XML-utils/) which
includes a number of utilities that make processing XML files with
line-based unix programs (grep, sed, awk, ...) a lot easier, e.g.
hxpipe and hxnormalize. I would certainly recommend having a look at
these.
Best,
Milos
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list