[Corpora-List] XML parsers vs regex

maxwell maxwell at umiacs.umd.edu
Mon Jun 30 16:16:21 UTC 2014


On 2014-06-30 10:13, Darren Cook wrote:
> E.g. if your document looks like this, I'd rather use a regex to find
> the proper nouns:
> 
>   I am off to <place>London</place> <date>tomorrow</date>, and then
> <place>Cambridge</place> with <person>Mary</person> the <date>week
> after</date>.

But if you wanted to find all the <date>...</date> elements, and the 
line breaks are as shown, a regex by itself isn't going to work (in 
particular, it won't find 'week after').  You need a parser, or else you 
need to do some normalization of the XML (making sure line breaks don't 
occur inside the XML elements of interest).  And if you're going to 
normalize the XML anyway, you might be better off using an XML parser in 
the first place.

    Mike Maxwell

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list