[Corpora-List] XML parsers vs regex
maxwell
maxwell at umiacs.umd.edu
Mon Jun 30 20:25:09 UTC 2014
On 2014-06-30 16:13, Mark A. Greenwood wrote:
> hmm, not sure that's sensible. In this example you might end up with
> extracting a date of "weekafter" if you removed the line breaks.
>
> The moral of the story is ... if you have XML use an XML parser to
> extract the data. Yes it might take slightly longer to start with, but
> the time spent will be re-paid by not having to worry about weird edge
> cases and formatting issues,
I cannot be accused of supporting the use of regular expressions for
finding things in XML, so I'll agree with you, Mark. (My normal tool
for that kind of search is xml_grep, with Python SAX and DOM libraries
for anything that's not one-off.)
That said, I have been known to do
cat [input file(s)] \
| sed -e "s/\n/ /" \
-e "s%</[^>]*>%>$1\n%g" \
| grep ...
when I'm desperate. The first sed expr replaces all newlines with a
space char (to prevent "weekafter"), the second puts a newline after
every close tag (since otherwise grep would just output its input--one
very l o n g line), and then you can grep.
This is only useable on very simple things, e.g. if you're looking for
"week after next" and the input is "...week <emphasis>after</emphasis>
next", this won't work.
I'm not desperate today.
Mike Maxwell
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list