Corpora: Tgrep2

Wed May 23 16:20:38 UTC 2001

The readers of this list may be interested in a new tool, tgrep2, that I
have developed for searching parsed corpora such as those included in
the Penn Treebank.

As the name might suggest, tgrep2 is based on tgrep and is largely
backward compatible.  However, tgrep2 adds a number of new features,
including the following major enhancements:

 * Rather than simply having a set of required relationships and a set
of
   prohibited relationships, nodes can have full boolean expressions of
   relationships to other nodes.
 * Nodes can be given unique labels and may then be referred to by those
   labels in the pattern specification or in selecting trees for
printing.
 * Patterns are no longer restricted to simple tree architectures. The
use
   of node labels and segmented patterns allows links in a pattern to
form
   back-edges as well, permitting cycles of links.
 * Customizable output formats allow a variety of information to be
   reported in a flexible manner.
 * Multiple search patterns may be specified and one can retrieve the
   first subtree matching any pattern, the first subtree matching each
   pattern, or all subtrees matching all patterns.
 * Subtrees can be reported using a code rather than by printing the
whole
   structure. The trees themselves can later be retrieved using the
codes.
 * A variety of new links have been added and the immediately-precedes
   link now has a more conventional meaning.
 * Tgrep2 corpus files are substantially smaller than tgrep corpora.

More information and the tgrep2 software can be found at the following
site:

http://www.cs.cmu.edu/~dr/Tgrep2/

Doug Rohde
Carnegie Mellon University