[Corpora-List] XML parsers vs regex

maxwell maxwell at umiacs.umd.edu
Mon Jun 30 21:32:01 UTC 2014


On 2014-06-30 16:40, Kilian Evang wrote:
> On 06/30/2014 06:02 PM, Milos Jakubicek wrote:
>> Exactly. Though XML-aware tools (like XPath) look like "the right
>> thing", you should try to avoid them as far as you only can. A regexp
>> will be always faster, simpler, easier to understand for others.
> 
> Easier to understand? I'd say once you have a basic understanding of
> XPath, it is way more readable than regexes. For example:
> 
> Regex: <word pos="([^"]+)
> XPath: //word/@pos
> 
> Plus of course, the regex will break without you noticing and when you
> least expect it.

Indeed.  This regex assumes that all instances of the @pos attr in the 
document being searched use double quote marks; if any of them use 
single quotes, like
     <word pos='Noun'>
the above regex won't capture them.  (And good luck trying to allow 
either single *or* double quotes, if you have to use quotes around the 
entire regex.)

This regex also assumes that no other attrs ever intervene between 
'word' and 'pos', i.e. it won't work with
     <word script="Arabic" pos="Noun">

And then there are whitespace issues, like spaces before or after the 
equal sign, or multiple spaces between 'word' and 'pos' (or tabs or 
newlines--yes, we have one XML editor that frequently inserts newlines 
in the middle of tags).

So the "correct" regex would be something like
    <word([ \n\t]+[^>]+)*[ \n\t]+pos[ \n\t]*=[ \n\t]*"([^"]+)"
(not taking into account the possibility of single quoted attributes).  
Which looks pretty unreadable...  And no guarantees that this will 
actually work.

Ok, it would probably be better to use \s for
    [ \n\t]
since there are technically other whitespace characters.  So maybe
    <word(\s+[^>]+)*\s+pos\s*=\s*"([^"]+)"
Or you could use the word-boundary special character \b:
    <word\b[^>]*\bpos\s*=\s*"([^"]+)"

The trailing double quote in my "correct" regex is technically not 
required, since a greedy matcher matching against [^"]+ will capture 
everything up to but not including the "close" quote mark.  But I 
usually include it to make the intent clearer to humans...

Bottom line: regex's are only simpler than xpath if you ignore the 
variability that's possible in XML documents (or can somehow prevent 
it).

    Mike Maxwell

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list