[Corpora-List] Treebank annotation tools

Yannick Versley versley at sfs.uni-tuebingen.de
Wed Aug 29 13:57:41 UTC 2007


Hello Joakim,

I'm not sure if the XCDG tool created in Wolfgang Menzel's group in Hamburg 
would fit your bill since it has always been a bit cumbersome to install, and 
is tied to a specific formalism for implementing dependency grammars (WCDG).
But it has some unique features that I feel are worth mentioning, especially 
the really nice integration into the grammar.

> * Search: The tool should enable annotators to search for (complex) 
>   patterns involving both the primary linguistic data and various levels 
>   of annotation. (Examples: NPs within NPs, NPs headed by a proper name, 
>   NPs headed by "picture", NPs as subjects, verbs with two subjects.)
As far as I know, Xcdg does not support search directly, but there was a 
separate search engine that would use CDG formulae (i.e., specification over 
edges and lexical entries, with existential operators thrown in) and that 
could be used to feed the results into an Xcdg display somehow.
> * Display: The tool should be able to display annotated sentences 
>   graphically, in particular as the result of a search query.
See the screenshot at
http://nats-www.informatik.uni-hamburg.de/view/CDG/ScreenShots
> * Editing: The tool should enable annotators to edit the annotation in a 
>   "flexible and efficient manner", preferably by direct manipulation of 
>   graphically displayed annotation.
You can change labels, drag edges around and select appropriate lexical 
entries. Most convenient is the function of right-clicking on an edge or a 
label and the tool changes the parent or the label to what it thinks should 
be the right one (according to the grammar).
> * Validation: The tool should validate that the edited annotation conforms 
>   to the formal specifications of the annotation scheme. Minimally, this 
>   should imply that only valid annotation categories ("tags") are used, 
>   but it is desirable that also more global and/or structural constraints 
>   can be expressed and validated. (Examples: Every word must have a 
>   part-of-speech tag, every phrase must have a head, every dependency 
>   graph must have a unique root or must be projective.)
Since the annotation tool is tightly integrated with the grammar, all 
invariants encoded in the grammar are also checked by the editor (i.e., 
projectivity in the cases where it is required, acyclicity, verb arguments).
It is possible to create structures disfavored or disallowed by the grammar, 
unlike in LFG annotation environments, which is very useful when you run onto 
a construct that isn't covered by your grammar.
> * Documentation: The tool should support documentation of the annotation 
>   process, such as time stamping of edits, information about what parts of 
>   an annotation has been checked and validated, statistics on editing 
>   operations, etc. 
currently nonexisting, I think.
> * Standards: The tool should support the use of (well-documented) 
>   standards for corpus annotation (TEI, (X)CES, LAF, ...) or allow the 
>   user to define such standards in, e.g., XML.
There is a proprietary (but simple and well-documented) annotation format, 
which also has some XML variant of it. It's still pretty much nonstandard.
> * Interfaces: The tool should interface flexibly with other tools involved 
>   in the treebank development process, in particular taggers and parsers 
>   used for automatic annotation. 
The whole point of Xcdg is the integration of the parser - you have a good 
integration of the parser and any components it uses, including PP attacher, 
POS tagger; you can view the violated grammar constraints and thus gain 
insights into the reasons why the parser would prefer one parse or the other.
You can also parse sentences inside Xcdg, although this is somewhat pointless.
> * Specificity: The tool should have tailor-made support for treebank 
>   annotation, possibly at the expense of not supporting linguistic 
>   annotation of arbitrary complexity.
Xcdg is not useful for anything beyond treebank development. The screenshots 
show some screens for editing hierarchies, but as far as I know, this is not 
really used by anybody.
In terms of treebank development, Xcdg has been used to annotate a large 
German dependency treebank, consisting of multiple genres (I don't have exact 
figures around, but I think the whole thing has more sentences than Tiger...)

There is an Xcdg manual available on the page
http://nats-www.informatik.uni-hamburg.de/view/CDG/CdgManuals
and there is useful support through a mailing list (I think it was the
cdg at nats.informatik.uni-hamburg.de one).

Best,
Yannick
-- 
Yannick Versley
Seminar für Sprachwissenschaft, Abt. Computerlinguistik
Wilhelmstr. 19, 72074 Tübingen
Tel.: (07071) 29 77352

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list