[Corpora-List] XML parsers vs regex

Mon Jun 30 18:40:29 UTC 2014

Whatever they are currently at, let's up all labs by 1 and lectures by 2.

Mary Elaine Califf, PhD
Director/Associate Professor
School of Information Technology
Illinois State University
mecalif at ilstu.edu

This message is intended for the use of the individual or entity to which it is addressed and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient or the employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. 

If you have received this communication in error, please notify us immediately by email reply or by telephone at (309) - 438-8338 and immediately delete this message and any attachments.

-----Original Message-----
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Mark A. Greenwood
Sent: Monday, June 30, 2014 11:11 AM
To: Milos Jakubicek; corpora at uib.no
Subject: Re: [Corpora-List] XML parsers vs regex

On 30/06/14 17:02, Milos Jakubicek wrote:
> 2014-06-30 16:13 GMT+02:00 Darren Cook <darren at dcook.org>:
>
>> But apart from that, using regexes is normally fine for corpora,
> Exactly. Though XML-aware tools (like XPath) look like "the right 
> thing", you should try to avoid them as far as you only can. A regexp 
> will be always faster, simpler, easier to understand for others.
I'd strongly disagree with this.

Using regexp is fine when you know that your data is all from the same source with *exactly* the same formatting, but as you soon as you have multiple people/groups/tools producing data then even if they are supposed to be producing identical output differences will creep in and regexp will fail. Someone only has to add an attribute to a tag to break some of these regex examples. Treating XML as "strings with angle brackets" (for generation or parsing) is never a good idea. If the file is supposed to be XML, use an XML parser it will be easier/faster in the long run,

Mark

>
> Note that the W3C is developing a somewhat less known set of tools 
> called HTML-XML-Utils (http://www.w3.org/Tools/HTML-XML-utils/) which 
> includes a number of utilities that make processing XML files with 
> line-based unix programs (grep, sed, awk, ...) a lot easier, e.g.
> hxpipe and hxnormalize. I would certainly recommend having a look at 
> these.
>
> Best,
> Milos
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora