[Corpora-List] XML parsers vs regex
anders bjorkelund
anders at ims.uni-stuttgart.de
Mon Jun 30 20:01:44 UTC 2014
The only purpose of line breaks in XML is to increase human readability
anyway. The first thing I do when I extract stuff from XML with regexps is
to get rid of all \r and \n, then you don't have to think about that
anyway (by substituting them with the empty string). Might be somewhat
suboptimal, but typically speed isn't an issue anyway.
anders
On Mon, 30 Jun 2014, Matías Guzmán Naranjo wrote:
> [^<] works for me In python
>
>
> 2014-06-30 21:44 GMT+02:00 maxwell <maxwell at umiacs.umd.edu>:
> On 2014-06-30 15:33, Phil Gooch wrote:
> On Mon, Jun 30, 2014 at 7:08 PM, Matías Guzmán Naranjo
> <mortem.dei at gmail.com> wrote:
>
> wouldn't just writing <date>.*?</date> get me 'week after'?
>
>
> I'd go for
>
> <date>[^<]+</date>
>
> which will consume line breaks. Of course, this assumes that date only
> contains text and no other markup.
>
>
> Again, my knowledge of grep is probably dated. But I just tried the above, and it didn't work (it did not consume
> line breaks, so it couldn't find things that were on two successive lines). Are you using some command line
> parameter on grep that allows it to search across successive lines?
>
> Mike Maxwell
>
>
>
>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list