[Corpora-List] XML parsers vs regex

Angus Grieve-Smith grvsmth at panix.com
Tue Jul 1 01:57:28 UTC 2014


That's it exactly.  Should you use an XML parser or regexps?  Yes!

I am currently writing scripts to convert from inline (HTML) to offset 
markup and back.  For the offset-to-inline conversion, I know exactly 
what the XML looks like because it's generated by the inline-to-offset 
scripts, so I can use an XML parser.  But the XML parser I tried for the 
inline-to-offset conversion (Python's ElementTree) didn't seem very 
helpful for calculating offsets, so on some advice from a friend I 
switched to regexps.

PS: Perl, PHP and Python all have flags (/s or DOTALL) that make /./ 
match a newline.

On 6/30/2014 7:54 PM, anders bjorkelund wrote:
> Sure. No doubt that if you're trying to write a general purpose script 
> to extract whatever from arbitrary XML, you should, *no doubt*, use a 
> DOM or SAX parser and properly treat the XML source as an XML document.
>
> *However* there are plenty of cases where you're dealing with a 
> limited set of XML docs, that may, or may not use double and single 
> quotes interchangeably. In that case it might be worthwhile to cut 
> some corners by just using (e|f)grep, perl, or whatever, and throw out 
> pretty much everything you don't care about, remapping the unwelcome 
> quotes to whatever you prefer. While still extracting what you're 
> interested in. In my experience this is the most common application, 
> and blanking out newlines (or replacing them with a signle space) is 
> way simpler than employing a DOM parser to extract the tiny bits of 
> information that I was interested in.
>
> The above point is what I meant to make in my previous reply about 
> blanking out newlines.
>
>
> Obviously, if you intend to do anything more elaborate than that, then 
> you should totally consider employing an XML parser (DOM or SAX, 
> depending on your needs). This typically involves a fairly decent 
> understanding of the underlying XML schema though, which may or may 
> not be all that obvious/accessible.
>
> Just my two (euro)cents. Cheers,
> anders
>
> On Mon, 30 Jun 2014, maxwell wrote:
>
>> On 2014-06-30 16:40, Kilian Evang wrote:
>>> On 06/30/2014 06:02 PM, Milos Jakubicek wrote:
>>>> Exactly. Though XML-aware tools (like XPath) look like "the right
>>>> thing", you should try to avoid them as far as you only can. A regexp
>>>> will be always faster, simpler, easier to understand for others.
>>>
>>> Easier to understand? I'd say once you have a basic understanding of
>>> XPath, it is way more readable than regexes. For example:
>>>
>>> Regex: <word pos="([^"]+)
>>> XPath: //word/@pos
>>>
>>> Plus of course, the regex will break without you noticing and when you
>>> least expect it.
>>
>> Indeed.  This regex assumes that all instances of the @pos attr in 
>> the document being searched use double quote marks; if any of them 
>> use single quotes, like
>>    <word pos='Noun'>
>> the above regex won't capture them.  (And good luck trying to allow 
>> either single *or* double quotes, if you have to use quotes around 
>> the entire regex.)
>>
>> This regex also assumes that no other attrs ever intervene between 
>> 'word' and 'pos', i.e. it won't work with
>>    <word script="Arabic" pos="Noun">
>>
>> And then there are whitespace issues, like spaces before or after the 
>> equal sign, or multiple spaces between 'word' and 'pos' (or tabs or 
>> newlines--yes, we have one XML editor that frequently inserts 
>> newlines in the middle of tags).
>>
>> So the "correct" regex would be something like
>>   <word([ \n\t]+[^>]+)*[ \n\t]+pos[ \n\t]*=[ \n\t]*"([^"]+)"
>> (not taking into account the possibility of single quoted 
>> attributes).  Which looks pretty unreadable...  And no guarantees 
>> that this will actually work.
>>
>> Ok, it would probably be better to use \s for
>>   [ \n\t]
>> since there are technically other whitespace characters.  So maybe
>>   <word(\s+[^>]+)*\s+pos\s*=\s*"([^"]+)"
>> Or you could use the word-boundary special character \b:
>>   <word\b[^>]*\bpos\s*=\s*"([^"]+)"
>>
>> The trailing double quote in my "correct" regex is technically not 
>> required, since a greedy matcher matching against [^"]+ will capture 
>> everything up to but not including the "close" quote mark.  But I 
>> usually include it to make the intent clearer to humans...
>>
>> Bottom line: regex's are only simpler than xpath if you ignore the 
>> variability that's possible in XML documents (or can somehow prevent 
>> it).
>>
>>   Mike Maxwell
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora


-- 
				-Angus B. Grieve-Smith
				grvsmth at panix.com


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list