[Corpora-List] Treebank annotation tools
Yannick Versley
versley at sfs.uni-tuebingen.de
Wed Aug 29 13:57:41 UTC 2007
Hello Joakim,
I'm not sure if the XCDG tool created in Wolfgang Menzel's group in Hamburg
would fit your bill since it has always been a bit cumbersome to install, and
is tied to a specific formalism for implementing dependency grammars (WCDG).
But it has some unique features that I feel are worth mentioning, especially
the really nice integration into the grammar.
> * Search: The tool should enable annotators to search for (complex)
> patterns involving both the primary linguistic data and various levels
> of annotation. (Examples: NPs within NPs, NPs headed by a proper name,
> NPs headed by "picture", NPs as subjects, verbs with two subjects.)
As far as I know, Xcdg does not support search directly, but there was a
separate search engine that would use CDG formulae (i.e., specification over
edges and lexical entries, with existential operators thrown in) and that
could be used to feed the results into an Xcdg display somehow.
> * Display: The tool should be able to display annotated sentences
> graphically, in particular as the result of a search query.
See the screenshot at
http://nats-www.informatik.uni-hamburg.de/view/CDG/ScreenShots
> * Editing: The tool should enable annotators to edit the annotation in a
> "flexible and efficient manner", preferably by direct manipulation of
> graphically displayed annotation.
You can change labels, drag edges around and select appropriate lexical
entries. Most convenient is the function of right-clicking on an edge or a
label and the tool changes the parent or the label to what it thinks should
be the right one (according to the grammar).
> * Validation: The tool should validate that the edited annotation conforms
> to the formal specifications of the annotation scheme. Minimally, this
> should imply that only valid annotation categories ("tags") are used,
> but it is desirable that also more global and/or structural constraints
> can be expressed and validated. (Examples: Every word must have a
> part-of-speech tag, every phrase must have a head, every dependency
> graph must have a unique root or must be projective.)
Since the annotation tool is tightly integrated with the grammar, all
invariants encoded in the grammar are also checked by the editor (i.e.,
projectivity in the cases where it is required, acyclicity, verb arguments).
It is possible to create structures disfavored or disallowed by the grammar,
unlike in LFG annotation environments, which is very useful when you run onto
a construct that isn't covered by your grammar.
> * Documentation: The tool should support documentation of the annotation
> process, such as time stamping of edits, information about what parts of
> an annotation has been checked and validated, statistics on editing
> operations, etc.
currently nonexisting, I think.
> * Standards: The tool should support the use of (well-documented)
> standards for corpus annotation (TEI, (X)CES, LAF, ...) or allow the
> user to define such standards in, e.g., XML.
There is a proprietary (but simple and well-documented) annotation format,
which also has some XML variant of it. It's still pretty much nonstandard.
> * Interfaces: The tool should interface flexibly with other tools involved
> in the treebank development process, in particular taggers and parsers
> used for automatic annotation.
The whole point of Xcdg is the integration of the parser - you have a good
integration of the parser and any components it uses, including PP attacher,
POS tagger; you can view the violated grammar constraints and thus gain
insights into the reasons why the parser would prefer one parse or the other.
You can also parse sentences inside Xcdg, although this is somewhat pointless.
> * Specificity: The tool should have tailor-made support for treebank
> annotation, possibly at the expense of not supporting linguistic
> annotation of arbitrary complexity.
Xcdg is not useful for anything beyond treebank development. The screenshots
show some screens for editing hierarchies, but as far as I know, this is not
really used by anybody.
In terms of treebank development, Xcdg has been used to annotate a large
German dependency treebank, consisting of multiple genres (I don't have exact
figures around, but I think the whole thing has more sentences than Tiger...)
There is an Xcdg manual available on the page
http://nats-www.informatik.uni-hamburg.de/view/CDG/CdgManuals
and there is useful support through a mailing list (I think it was the
cdg at nats.informatik.uni-hamburg.de one).
Best,
Yannick
--
Yannick Versley
Seminar für Sprachwissenschaft, Abt. Computerlinguistik
Wilhelmstr. 19, 72074 Tübingen
Tel.: (07071) 29 77352
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list