[Corpora-List] XML parsers vs regex

maxwell maxwell at umiacs.umd.edu
Mon Jun 30 20:25:09 UTC 2014


On 2014-06-30 16:13, Mark A. Greenwood wrote:
> hmm, not sure that's sensible. In this example you might end up with
> extracting a date of "weekafter" if you removed the line breaks.
> 
>  The moral of the story is ... if you have XML use an XML parser to
> extract the data. Yes it might take slightly longer to start with, but
> the time spent will be re-paid by not having to worry about weird edge
> cases and formatting issues,

I cannot be accused of supporting the use of regular expressions for 
finding things in XML, so I'll agree with you, Mark.  (My normal tool 
for that kind of search is xml_grep, with Python SAX and DOM libraries 
for anything that's not one-off.)

That said, I have been known to do
      cat [input file(s)] \
    | sed -e "s/\n/ /" \
          -e "s%</[^>]*>%>$1\n%g" \
    | grep ...
when I'm desperate. The first sed expr replaces all newlines with a 
space char (to prevent "weekafter"), the second puts a newline after 
every close tag (since otherwise grep would just output its input--one 
very l o n g line), and then you can grep.

This is only useable on very simple things, e.g. if you're looking for 
"week after next" and the input is "...week <emphasis>after</emphasis> 
next", this won't work.

I'm not desperate today.

    Mike Maxwell

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list