[Corpora-List] XML parsers vs regex
anders bjorkelund
anders at ims.uni-stuttgart.de
Mon Jun 30 23:54:43 UTC 2014
Sure. No doubt that if you're trying to write a general purpose script to
extract whatever from arbitrary XML, you should, *no doubt*, use a DOM or
SAX parser and properly treat the XML source as an XML document.
*However* there are plenty of cases where you're dealing with a limited
set of XML docs, that may, or may not use double and single quotes
interchangeably. In that case it might be worthwhile to cut some corners
by just using (e|f)grep, perl, or whatever, and throw out pretty much
everything you don't care about, remapping the unwelcome quotes to
whatever you prefer. While still extracting what you're interested in. In
my experience this is the most common application, and blanking out
newlines (or replacing them with a signle space) is way simpler than
employing a DOM parser to extract the tiny bits of information that I was
interested in.
The above point is what I meant to make in my previous reply about
blanking out newlines.
Obviously, if you intend to do anything more elaborate than that, then you
should totally consider employing an XML parser (DOM or SAX, depending on
your needs). This typically involves a fairly decent understanding of the
underlying XML schema though, which may or may not be all that
obvious/accessible.
Just my two (euro)cents. Cheers,
anders
On Mon, 30 Jun 2014, maxwell wrote:
> On 2014-06-30 16:40, Kilian Evang wrote:
>> On 06/30/2014 06:02 PM, Milos Jakubicek wrote:
>>> Exactly. Though XML-aware tools (like XPath) look like "the right
>>> thing", you should try to avoid them as far as you only can. A regexp
>>> will be always faster, simpler, easier to understand for others.
>>
>> Easier to understand? I'd say once you have a basic understanding of
>> XPath, it is way more readable than regexes. For example:
>>
>> Regex: <word pos="([^"]+)
>> XPath: //word/@pos
>>
>> Plus of course, the regex will break without you noticing and when you
>> least expect it.
>
> Indeed. This regex assumes that all instances of the @pos attr in the
> document being searched use double quote marks; if any of them use single
> quotes, like
> <word pos='Noun'>
> the above regex won't capture them. (And good luck trying to allow either
> single *or* double quotes, if you have to use quotes around the entire
> regex.)
>
> This regex also assumes that no other attrs ever intervene between 'word' and
> 'pos', i.e. it won't work with
> <word script="Arabic" pos="Noun">
>
> And then there are whitespace issues, like spaces before or after the equal
> sign, or multiple spaces between 'word' and 'pos' (or tabs or newlines--yes,
> we have one XML editor that frequently inserts newlines in the middle of
> tags).
>
> So the "correct" regex would be something like
> <word([ \n\t]+[^>]+)*[ \n\t]+pos[ \n\t]*=[ \n\t]*"([^"]+)"
> (not taking into account the possibility of single quoted attributes). Which
> looks pretty unreadable... And no guarantees that this will actually work.
>
> Ok, it would probably be better to use \s for
> [ \n\t]
> since there are technically other whitespace characters. So maybe
> <word(\s+[^>]+)*\s+pos\s*=\s*"([^"]+)"
> Or you could use the word-boundary special character \b:
> <word\b[^>]*\bpos\s*=\s*"([^"]+)"
>
> The trailing double quote in my "correct" regex is technically not required,
> since a greedy matcher matching against [^"]+ will capture everything up to
> but not including the "close" quote mark. But I usually include it to make
> the intent clearer to humans...
>
> Bottom line: regex's are only simpler than xpath if you ignore the
> variability that's possible in XML documents (or can somehow prevent it).
>
> Mike Maxwell
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list