[Corpora-List] trying to parse crossing dependencies

Mon Oct 6 08:28:33 UTC 2008

Hi,

>  I always wondered about how do you parse trennbare Verben in German,
> which to me, even if not the same kind of problem, does relate to
> "crossing dependencies" somehow
Parsing separable verbs, but also topicalized objects in German, or relative 
clauses in most languages, is usually difficult and people have come up with 
different solutions which all somehow work but don't really look completely 
satisfying.
* You can just ignore the problem by making everything projective and sort out 
everything after your parser has done the grunt work. This is the common 
approach when people do PCFG parsing on e.g. the Tiger treebank - they just 
reattach the offending nodes to a higher level and pretend nothing has 
happened.
The MALT dependency parser (Nivre&Nilsson 2005) reattaches edges and puts 
different labels on them so a post-processing step can reattach things to 
where they came from. This is more honest than the former since, unlike using 
PARSEVAL metrics on the transformed structures, you can actually compare the 
parse+postprocessing result directly to what you aimed at.
* There is the biting-the-bullet approach of just doing non-projective parsing 
by either solviong a harder problem or doing a coarser approximation.
The WCDG approach (Menzel&Schröder 1998, Foth/Daum/Menzel 2004) simply puts 
nonprojective parsing into a weighted constraint satisfaction problem, where 
you can posit arbitrary constraints on parse trees. This usually means that 
efficiently solving this problem requires a considerable engineering effort.
A method using a more general-purpose constraint server can be found in 
(Riedel&Clarke, 2006)
A slightly less ambitious version of this would be (McDonald&Pereira 2006), 
who first use a projective parsing algorithm to find a reasonable parse and 
then use hill-climbing to re-attach non-projective edges.
And there's also an approach of using an O(n^2) spanning tree algorithm to 
directly get a non-projective dependency tree, as in (McDonald&Pereira 2005).
The obvious problem of this approach is that you cannot represent dependencies 
between edges; so if you want to keep the algorithm not to attach two 
determiners or two subject to one word, you have to do this with approximate, 
indirect means (aka "some fumbling required").
There are people who try to base their formalism on parsing algorithms that 
cover *exactly* the class of mildly-context-sensitive grammars, which are 
parsable in O(n^6) worst case time [which is still dreadful, but sounds much 
nicer than the exponential worst-case in the case of general constraint 
satisfaction]. See e.g. (Lichte&Kallmeyer 2008) for one approach in German.

For handling separable verbs, there are different approaches:
* one would be to get an approximate idea of clause structure (through clause 
chunking) and gather the parts of the separable verbs to put them together 
and look them up. This is quite obviously a 90% solution, but it works well 
enough in cases where you want the verb lemma, but can't be bothered to do 
full parsing.
* In the WCDG approach, you *can* have a constraint that looks at the verb 
prefix and does the right thing. For practical reasons (as far as I know), 
the handling of verb valencies is implemented approximately as a kind 
of "this verb can have a dative object when it occurs without a prefix, and 
an accusative and/or a dative object").
* In HPSG implementations (e.g. Müller 2002 
http://hpsg.fu-berlin.de/~stefan/Pub/paradox.html for an overview) 
usually treat the separable prefix as an argument to the verb, which means 
that there are lots of different lexicon entries for one surface verb form 
(i.e., "hat" has lexicon entries for vor#haben, an#haben, auf#haben, etc.)

Separable prefixes are a different class of problems than the worst 
non-projectivities (the ones which make the mild context sensitivity 
necessary), however, since they can be reasonably treated with a (non-Xbar) 
PCFG grammar.

Best wishes,
Yannick Versley
-- 
Yannick Versley
Seminar für Sprachwissenschaft, Abt. Computerlinguistik
Wilhelmstr. 19, 72074 Tübingen
Tel.: (07071) 29 77352

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora