[Corpora-List] XML parsers vs regex
Angus Grieve-Smith
grvsmth at panix.com
Tue Jul 1 01:57:28 UTC 2014
That's it exactly. Should you use an XML parser or regexps? Yes!
I am currently writing scripts to convert from inline (HTML) to offset
markup and back. For the offset-to-inline conversion, I know exactly
what the XML looks like because it's generated by the inline-to-offset
scripts, so I can use an XML parser. But the XML parser I tried for the
inline-to-offset conversion (Python's ElementTree) didn't seem very
helpful for calculating offsets, so on some advice from a friend I
switched to regexps.
PS: Perl, PHP and Python all have flags (/s or DOTALL) that make /./
match a newline.
On 6/30/2014 7:54 PM, anders bjorkelund wrote:
> Sure. No doubt that if you're trying to write a general purpose script
> to extract whatever from arbitrary XML, you should, *no doubt*, use a
> DOM or SAX parser and properly treat the XML source as an XML document.
>
> *However* there are plenty of cases where you're dealing with a
> limited set of XML docs, that may, or may not use double and single
> quotes interchangeably. In that case it might be worthwhile to cut
> some corners by just using (e|f)grep, perl, or whatever, and throw out
> pretty much everything you don't care about, remapping the unwelcome
> quotes to whatever you prefer. While still extracting what you're
> interested in. In my experience this is the most common application,
> and blanking out newlines (or replacing them with a signle space) is
> way simpler than employing a DOM parser to extract the tiny bits of
> information that I was interested in.
>
> The above point is what I meant to make in my previous reply about
> blanking out newlines.
>
>
> Obviously, if you intend to do anything more elaborate than that, then
> you should totally consider employing an XML parser (DOM or SAX,
> depending on your needs). This typically involves a fairly decent
> understanding of the underlying XML schema though, which may or may
> not be all that obvious/accessible.
>
> Just my two (euro)cents. Cheers,
> anders
>
> On Mon, 30 Jun 2014, maxwell wrote:
>
>> On 2014-06-30 16:40, Kilian Evang wrote:
>>> On 06/30/2014 06:02 PM, Milos Jakubicek wrote:
>>>> Exactly. Though XML-aware tools (like XPath) look like "the right
>>>> thing", you should try to avoid them as far as you only can. A regexp
>>>> will be always faster, simpler, easier to understand for others.
>>>
>>> Easier to understand? I'd say once you have a basic understanding of
>>> XPath, it is way more readable than regexes. For example:
>>>
>>> Regex: <word pos="([^"]+)
>>> XPath: //word/@pos
>>>
>>> Plus of course, the regex will break without you noticing and when you
>>> least expect it.
>>
>> Indeed. This regex assumes that all instances of the @pos attr in
>> the document being searched use double quote marks; if any of them
>> use single quotes, like
>> <word pos='Noun'>
>> the above regex won't capture them. (And good luck trying to allow
>> either single *or* double quotes, if you have to use quotes around
>> the entire regex.)
>>
>> This regex also assumes that no other attrs ever intervene between
>> 'word' and 'pos', i.e. it won't work with
>> <word script="Arabic" pos="Noun">
>>
>> And then there are whitespace issues, like spaces before or after the
>> equal sign, or multiple spaces between 'word' and 'pos' (or tabs or
>> newlines--yes, we have one XML editor that frequently inserts
>> newlines in the middle of tags).
>>
>> So the "correct" regex would be something like
>> <word([ \n\t]+[^>]+)*[ \n\t]+pos[ \n\t]*=[ \n\t]*"([^"]+)"
>> (not taking into account the possibility of single quoted
>> attributes). Which looks pretty unreadable... And no guarantees
>> that this will actually work.
>>
>> Ok, it would probably be better to use \s for
>> [ \n\t]
>> since there are technically other whitespace characters. So maybe
>> <word(\s+[^>]+)*\s+pos\s*=\s*"([^"]+)"
>> Or you could use the word-boundary special character \b:
>> <word\b[^>]*\bpos\s*=\s*"([^"]+)"
>>
>> The trailing double quote in my "correct" regex is technically not
>> required, since a greedy matcher matching against [^"]+ will capture
>> everything up to but not including the "close" quote mark. But I
>> usually include it to make the intent clearer to humans...
>>
>> Bottom line: regex's are only simpler than xpath if you ignore the
>> variability that's possible in XML documents (or can somehow prevent
>> it).
>>
>> Mike Maxwell
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
--
-Angus B. Grieve-Smith
grvsmth at panix.com
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list