[Corpora-List] XML parsers vs regex

anders bjorkelund anders at ims.uni-stuttgart.de
Mon Jun 30 20:01:44 UTC 2014


The only purpose of line breaks in XML is to increase human readability 
anyway. The first thing I do when I extract stuff from XML with regexps is 
to get rid of all \r and \n, then you don't have to think about that 
anyway (by substituting them with the empty string). Might be somewhat 
suboptimal, but typically speed isn't an issue anyway.

anders

On Mon, 30 Jun 2014, Matías Guzmán Naranjo wrote:

> [^<] works for me In python
> 
> 
> 2014-06-30 21:44 GMT+02:00 maxwell <maxwell at umiacs.umd.edu>:
>       On 2014-06-30 15:33, Phil Gooch wrote:
>             On Mon, Jun 30, 2014 at 7:08 PM, Matías Guzmán Naranjo
>             <mortem.dei at gmail.com> wrote:
>
>                   wouldn't just writing <date>.*?</date> get me 'week after'?
> 
>
>             I'd go for
>
>             <date>[^<]+</date>
>
>             which will consume line breaks. Of course, this assumes that date only
>             contains text and no other markup.
> 
> 
> Again, my knowledge of grep is probably dated.  But I just tried the above, and it didn't work (it did not consume
> line breaks, so it couldn't find things that were on two successive lines).  Are you using some command line
> parameter on grep that allows it to search across successive lines?
> 
>    Mike Maxwell
> 
> 
> 
>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list