Corpora: learning regular expressions: responses

Chapman, Wendy chapman at cbmi.upmc.edu
Tue Dec 12 14:27:39 UTC 2000


Dear Corpora members,

Thank you for the responses to regular expression learning that I posted a
few weeks ago. I have included all the responses I received on the subject.

Wendy Webber Chapman


____________________________________________________________________________
________


Stephen Soderland's system WHISK applies learns regular expressions for
information extraction.  It's implemented in Perl.  He's published an
article on it in Machine Learning and had a paper at KDD.  You could get
more information from either of those sources.

Mary Elaine Califf

_________________________________________________________________________
  You can download a Unix version of Brill's original POS tagger, which
uses the same TBL algorithm that he modified in his paper at EMNLP. I wa
very excited by this most recent version of the TBL algorithm, so I can
understand your interest in it.
 Try:

http://www.cs.jhu.edu/~brill/code.html
 or if you prefer ftp:

ftp://ftp.cs.jhu.edu/pub/brill/Programs/
  There is a port of the Brill tagger to windows, done by some French
folks, but it never worked well for me, and it doesn't coem with source
code.The original on the other hand is open-source software, written in C,
and it depends pretty heavily on the Unix OS for memory management.
 Most folks using this and similar algorithms are looking for high
precision and recall rates for POS tagging or parsing, and aren't really
often very eager to take on ambiguities, except insofar as a parser will
give good rankings of ambiguous parses.
  Good luck!

-Mike

   vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   >< Michael O'Connell                   ><
   >< http://ucsu.colorado.edu/~oconnelm  ><
   >< University of Colorado - Boulder    ><
   >< CB 295           Boulder, CO 80309  ><
   >< Hellems 285      303.492.1623       ><
   vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

____________________________________________________________________________
_________

Regular expression pattern learning has been fairly well-mined by the
'formal' machine learning community.  You might want to look e.g., at
the (old!) paper by Sam Pilato and I in the J. Machine Learning, 1985,
where we used a method developed by Dana Angluin at Yale - it essentially
does what's called k-tail merger of the finite-state equivalence classes.
Alas, our very old Lisp implementation is no longer around, though the
paper has pseudo-code that should suffice.  You might want to track down Sam
Pilato.
This is a restrictive variant of an approach that was, to the best of my
knowledge,
employed by (even older!) work in the 60s by Solomonoff and many others
to learn reg-exps.  Angluin has some nice formal results on the difference
in
computational complexity betw. learning reg. expressions vs. fsa's, etc.
Hope this is of some help,
Best regards, Prof. Bob Berwick
Professor Robert C. Berwick
[berwick at ai.mit.edu]________________________________________________________
____________________________________________

If I understand what you need, maybe we have something useful for you.
We have an algorithm (LocalMaxs) that extracts multi-word units from
text of any language. For example : Human Rights,  Universal Declaration
of Human Rights, as soon as possible, plus au moins, raining cats and
dogs, Yasser Arafat, Issac Rabin, etc.

Joaquim Ferreira da Silva
jfs at di.fct.unl.pt



More information about the Corpora mailing list