[Corpora-List] Metrics for corpus "parseability"

Tue Feb 5 13:26:11 UTC 2008

On Mon, Feb 04, 2008 at 11:27:56PM +0100, Sandra Kuebler wrote:

> There is related work about the ambiguity of grammars induced from
> treebanks. Anna Corazza, Alberto Lavelli, and Giorgio Satta used
> conditional cross entropy for that. This may help to at least
> abstract away from the parser :)

Sandra, thanks for mentioning our work related to the part of the
original request concerning parsing.  The work is described in a paper
submitted to a journal and currently under revision.  We plan to make
the current draft available on our web site soon.  Below the abstract.

best
	alberto

---------------

Measuring Parsing Difficulty Across Treebanks
Anna Corazza, Alberto Lavelli and Giorgio Satta

Abstract
One of the main difficulties in statistical parsing is associated with
the task of choosing the correct parse tree for the input sentence,
among all possible parse trees allowed by the adopted grammar model.
While this difficulty is usually evaluated by means of empirical
performance measures, such as labeled precision and recall, several
theoretical measures have also been proposed in the literature, mostly
based on the notion of cross-entropy of a treebank.  In this article
we show how cross-entropy can be misleading to this end.  We propose
an alternative theoretical measure, called the expected conditional
cross-entropy (ECC), which can be approximated through the inverse and
normalized conditional log-likelihood of a treebank, relative to some
model.
We conjecture that the ECC provides a measure of the informativeness
of a treebank, in such a way that more informative treebanks are
easier to parse under the chosen model.  We test our conjecture by
comparing ECC values against standard performance measures across
several treebanks for English, French, German and Italian, as well as
other treebanks with different degrees of ambiguity and
informativeness, obtained by means of artificial transformations of a
source treebank.  All of our experiments show the effectiveness of the
ECC in characterizing parsing difficulty across different treebanks,
making it possible treebank comparison.

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora