Corpora: Tgrep2
Douglas Rohde
dr+ at cs.cmu.edu
Wed May 23 16:20:38 UTC 2001
The readers of this list may be interested in a new tool, tgrep2, that I
have developed for searching parsed corpora such as those included in
the Penn Treebank.
As the name might suggest, tgrep2 is based on tgrep and is largely
backward compatible. However, tgrep2 adds a number of new features,
including the following major enhancements:
* Rather than simply having a set of required relationships and a set
of
prohibited relationships, nodes can have full boolean expressions of
relationships to other nodes.
* Nodes can be given unique labels and may then be referred to by those
labels in the pattern specification or in selecting trees for
printing.
* Patterns are no longer restricted to simple tree architectures. The
use
of node labels and segmented patterns allows links in a pattern to
form
back-edges as well, permitting cycles of links.
* Customizable output formats allow a variety of information to be
reported in a flexible manner.
* Multiple search patterns may be specified and one can retrieve the
first subtree matching any pattern, the first subtree matching each
pattern, or all subtrees matching all patterns.
* Subtrees can be reported using a code rather than by printing the
whole
structure. The trees themselves can later be retrieved using the
codes.
* A variety of new links have been added and the immediately-precedes
link now has a more conventional meaning.
* Tgrep2 corpus files are substantially smaller than tgrep corpora.
More information and the tgrep2 software can be found at the following
site:
http://www.cs.cmu.edu/~dr/Tgrep2/
Doug Rohde
Carnegie Mellon University
More information about the Corpora
mailing list