[Corpora-List] XML parsers vs regex

Darren Cook darren at dcook.org
Mon Jun 30 14:13:30 UTC 2014


> I'm just going to leave this here.
> http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

(Thanks, I'd not see that before.)

But apart from that, using regexes is normally fine for corpora,
*unless* they use nesting. What makes regex hard for XML/HTML is when
you want to say things like: find all text in <b> tags when it is inside
a <div> of class="this", whereas CSS selectors or XPath cope just fine.

E.g. if your document looks like this, I'd rather use a regex to find
the proper nouns:

  I am off to <place>London</place> <date>tomorrow</date>, and then
<place>Cambridge</place> with <person>Mary</person> the <date>week
after</date>.

If, on the other hand, *every* word was tagged, and the tags use
attributes for part of speech, and then clauses are wrapped with tags,
and sentences are wrapped with tags, I'd use an XML parser.

Darren




-- 
Darren Cook, Software Researcher/Developer
My new book: Data Push Apps with HTML5 SSE
Published by O'Reilly: (ask me for a discount code!)
  http://shop.oreilly.com/product/0636920030928.do
Also on Amazon and at all good booksellers!

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list