[Corpora-List] XML parsers vs regex

Mon Jun 30 23:54:43 UTC 2014

Sure. No doubt that if you're trying to write a general purpose script to 
extract whatever from arbitrary XML, you should, *no doubt*, use a DOM or 
SAX parser and properly treat the XML source as an XML document.

*However* there are plenty of cases where you're dealing with a limited 
set of XML docs, that may, or may not use double and single quotes 
interchangeably. In that case it might be worthwhile to cut some corners 
by just using (e|f)grep, perl, or whatever, and throw out pretty much 
everything you don't care about, remapping the unwelcome quotes to 
whatever you prefer. While still extracting what you're interested in. In 
my experience this is the most common application, and blanking out 
newlines (or replacing them with a signle space) is way simpler than 
employing a DOM parser to extract the tiny bits of information that I was 
interested in.

The above point is what I meant to make in my previous reply about 
blanking out newlines.

Obviously, if you intend to do anything more elaborate than that, then you 
should totally consider employing an XML parser (DOM or SAX, depending on 
your needs). This typically involves a fairly decent understanding of the 
underlying XML schema though, which may or may not be all that 
obvious/accessible.

Just my two (euro)cents. Cheers,
anders

On Mon, 30 Jun 2014, maxwell wrote:

> On 2014-06-30 16:40, Kilian Evang wrote:
>> On 06/30/2014 06:02 PM, Milos Jakubicek wrote:
>>> Exactly. Though XML-aware tools (like XPath) look like "the right
>>> thing", you should try to avoid them as far as you only can. A regexp
>>> will be always faster, simpler, easier to understand for others.
>> 
>> Easier to understand? I'd say once you have a basic understanding of
>> XPath, it is way more readable than regexes. For example:
>> 
>> Regex: <word pos="([^"]+)
>> XPath: //word/@pos
>> 
>> Plus of course, the regex will break without you noticing and when you
>> least expect it.
>
> Indeed.  This regex assumes that all instances of the @pos attr in the 
> document being searched use double quote marks; if any of them use single 
> quotes, like
>    <word pos='Noun'>
> the above regex won't capture them.  (And good luck trying to allow either 
> single *or* double quotes, if you have to use quotes around the entire 
> regex.)
>
> This regex also assumes that no other attrs ever intervene between 'word' and 
> 'pos', i.e. it won't work with
>    <word script="Arabic" pos="Noun">
>
> And then there are whitespace issues, like spaces before or after the equal 
> sign, or multiple spaces between 'word' and 'pos' (or tabs or newlines--yes, 
> we have one XML editor that frequently inserts newlines in the middle of 
> tags).
>
> So the "correct" regex would be something like
>   <word([ \n\t]+[^>]+)*[ \n\t]+pos[ \n\t]*=[ \n\t]*"([^"]+)"
> (not taking into account the possibility of single quoted attributes).  Which 
> looks pretty unreadable...  And no guarantees that this will actually work.
>
> Ok, it would probably be better to use \s for
>   [ \n\t]
> since there are technically other whitespace characters.  So maybe
>   <word(\s+[^>]+)*\s+pos\s*=\s*"([^"]+)"
> Or you could use the word-boundary special character \b:
>   <word\b[^>]*\bpos\s*=\s*"([^"]+)"
>
> The trailing double quote in my "correct" regex is technically not required, 
> since a greedy matcher matching against [^"]+ will capture everything up to 
> but not including the "close" quote mark.  But I usually include it to make 
> the intent clearer to humans...
>
> Bottom line: regex's are only simpler than xpath if you ignore the 
> variability that's possible in XML documents (or can somehow prevent it).
>
>   Mike Maxwell
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora