[Corpora-List] XML parsers vs regex

Matías Guzmán Naranjo mortem.dei at gmail.com
Mon Jun 30 18:08:39 UTC 2014


wouldn't just writing <date>.*?</date> get me 'week after'?

I really can do everything I need with regular expressions. The question is
more about what is easier in the long run. Some times I feel I'm writing
too many 'for's and 'if's...


2014-06-30 18:16 GMT+02:00 maxwell <maxwell at umiacs.umd.edu>:

> On 2014-06-30 10:13, Darren Cook wrote:
>
>> E.g. if your document looks like this, I'd rather use a regex to find
>> the proper nouns:
>>
>>   I am off to <place>London</place> <date>tomorrow</date>, and then
>> <place>Cambridge</place> with <person>Mary</person> the <date>week
>> after</date>.
>>
>
> But if you wanted to find all the <date>...</date> elements, and the line
> breaks are as shown, a regex by itself isn't going to work (in particular,
> it won't find 'week after').  You need a parser, or else you need to do
> some normalization of the XML (making sure line breaks don't occur inside
> the XML elements of interest).  And if you're going to normalize the XML
> anyway, you might be better off using an XML parser in the first place.
>
>    Mike Maxwell
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140630/acfb52ad/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list