[Corpora-List] XML parsers vs regex

Mark A. Greenwood m.greenwood at dcs.shef.ac.uk
Mon Jun 30 20:13:25 UTC 2014


hmm, not sure that's sensible. In this example you might end up with
extracting a date of "weekafter" if you removed the line breaks.

The moral of the story is ... if you have XML use an XML parser to
extract the data. Yes it might take slightly longer to start with, but
the time spent will be re-paid by not having to worry about weird edge
cases and formatting issues,

Mark

On 30/06/14 21:01, anders bjorkelund wrote:
> The only purpose of line breaks in XML is to increase human
> readability anyway. The first thing I do when I extract stuff from XML
> with regexps is to get rid of all \r and \n, then you don't have to
> think about that anyway (by substituting them with the empty string).
> Might be somewhat suboptimal, but typically speed isn't an issue anyway.
>
> anders
>
> On Mon, 30 Jun 2014, Matías Guzmán Naranjo wrote:
>
>> [^<] works for me In python
>>
>>
>> 2014-06-30 21:44 GMT+02:00 maxwell <maxwell at umiacs.umd.edu>:
>>       On 2014-06-30 15:33, Phil Gooch wrote:
>>             On Mon, Jun 30, 2014 at 7:08 PM, Matías Guzmán Naranjo
>>             <mortem.dei at gmail.com> wrote:
>>
>>                   wouldn't just writing <date>.*?</date> get me 'week
>> after'?
>>
>>
>>             I'd go for
>>
>>             <date>[^<]+</date>
>>
>>             which will consume line breaks. Of course, this assumes
>> that date only
>>             contains text and no other markup.
>>
>>
>> Again, my knowledge of grep is probably dated.  But I just tried the
>> above, and it didn't work (it did not consume
>> line breaks, so it couldn't find things that were on two successive
>> lines).  Are you using some command line
>> parameter on grep that allows it to search across successive lines?
>>
>>    Mike Maxwell
>>
>>
>>
>>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140630/ed42fc44/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list