[Corpora-List] XML parsers vs regex

Phil Gooch philgooch at gmail.com
Mon Jun 30 17:05:08 UTC 2014


Regex can be useful as a fallback if you are processing XML from an unknown
source or of dubious quality and you just want to extract a few fields.

E.g. if you're pulling in TEI XML that fails to parse, maybe catch the
exception and then fall back to a regex to extract a title, author,
abstract etc.

Phil


On Mon, Jun 30, 2014 at 5:44 PM, Piotr Bański <bansp at o2.pl> wrote:

> Dear Matías,
>
> This topic sometimes briefly surfaces on the xml-dev list, before it
> goes out in flame. You might want to check the archives at:
>
> http://lists.xml.org/archives/xml-dev/
>
> In most cases, the reply, naturally, begins with "it depends": for
> trivial cases addressing simple embedded markup, why not regex, but for
> more complex cases, you may want to start thinking vis-a-vis the
> complexity of the source (see Mike Maxwell's reply for starters) and the
> complexity of what you want to retrieve, and then please do not forget
> to think about making your queries portable and verifiable/readable for
> others, especially those of us who aren't regex-geeks.
>
> Best regards,
>
>   Piotr
>
> On 30/06/14 13:55, Matías Guzmán Naranjo wrote:
> > Dear all,
> >
> > When working with xml tagged corpora I have always used regex to extract
> > the information I need, I have never used xml parsers like nltk's or any
> > other. Is there an advantage to using parsers vs using regex? Which?
> > what do you personally use?
> >
> > Best,
> >
> > Matías
> >
> >
> > _______________________________________________
> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> > Corpora mailing list
> > Corpora at uib.no
> > http://mailman.uib.no/listinfo/corpora
> >
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140630/7a8fcc37/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list