[Corpora-List] XML parsers vs regex

Mark A. Greenwood m.greenwood at dcs.shef.ac.uk
Mon Jun 30 18:16:13 UTC 2014


On 30/06/14 19:08, Matías Guzmán Naranjo wrote:
> wouldn't just writing <date>.*?</date> get me 'week after'?
That would depend on what options your regexp parser was using. By
default many of them don't let . match newline characters,

Mark

>
> I really can do everything I need with regular expressions. The
> question is more about what is easier in the long run. Some times I
> feel I'm writing too many 'for's and 'if's...
>
>
> 2014-06-30 18:16 GMT+02:00 maxwell <maxwell at umiacs.umd.edu
> <mailto:maxwell at umiacs.umd.edu>>:
>
>     On 2014-06-30 10:13, Darren Cook wrote:
>
>         E.g. if your document looks like this, I'd rather use a regex
>         to find
>         the proper nouns:
>
>           I am off to <place>London</place> <date>tomorrow</date>, and
>         then
>         <place>Cambridge</place> with <person>Mary</person> the <date>week
>         after</date>.
>
>
>     But if you wanted to find all the <date>...</date> elements, and
>     the line breaks are as shown, a regex by itself isn't going to
>     work (in particular, it won't find 'week after').  You need a
>     parser, or else you need to do some normalization of the XML
>     (making sure line breaks don't occur inside the XML elements of
>     interest).  And if you're going to normalize the XML anyway, you
>     might be better off using an XML parser in the first place.
>
>        Mike Maxwell
>
>
>     _______________________________________________
>     UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>     Corpora mailing list
>     Corpora at uib.no <mailto:Corpora at uib.no>
>     http://mailman.uib.no/listinfo/corpora
>
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140630/dd038052/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list