[Corpora-List] XML parsers vs regex
Mark A. Greenwood
m.greenwood at dcs.shef.ac.uk
Mon Jun 30 18:16:13 UTC 2014
On 30/06/14 19:08, Matías Guzmán Naranjo wrote:
> wouldn't just writing <date>.*?</date> get me 'week after'?
That would depend on what options your regexp parser was using. By
default many of them don't let . match newline characters,
Mark
>
> I really can do everything I need with regular expressions. The
> question is more about what is easier in the long run. Some times I
> feel I'm writing too many 'for's and 'if's...
>
>
> 2014-06-30 18:16 GMT+02:00 maxwell <maxwell at umiacs.umd.edu
> <mailto:maxwell at umiacs.umd.edu>>:
>
> On 2014-06-30 10:13, Darren Cook wrote:
>
> E.g. if your document looks like this, I'd rather use a regex
> to find
> the proper nouns:
>
> I am off to <place>London</place> <date>tomorrow</date>, and
> then
> <place>Cambridge</place> with <person>Mary</person> the <date>week
> after</date>.
>
>
> But if you wanted to find all the <date>...</date> elements, and
> the line breaks are as shown, a regex by itself isn't going to
> work (in particular, it won't find 'week after'). You need a
> parser, or else you need to do some normalization of the XML
> (making sure line breaks don't occur inside the XML elements of
> interest). And if you're going to normalize the XML anyway, you
> might be better off using an XML parser in the first place.
>
> Mike Maxwell
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no <mailto:Corpora at uib.no>
> http://mailman.uib.no/listinfo/corpora
>
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140630/dd038052/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list