[Corpora-List] Metrics for corpus "parseability"

Sun Feb 3 10:40:45 UTC 2008

Hi Sean,

> > Are there standard or widely accepted metrics for describing the
> > well-behavedness of corpora?
>
> The answer is, I think, a resounding 'no'.  There is disappointingly little
> work on systematically comparing corpora, or making objective general
> observations of one corpus in comparison to others.  (Citations proving me
> wrong are most welcome.  I'm aware of Sekine,  Roland and Jurafsky,
> Cavaglia, also work on genre by eg Karlgren, Santini, Sharoff, which touches
> on the topic)
>

About general observations of one corpus in comparison to others,
there is a recent article (in French) about the different performance
of NLP tools applied to corpora of different genres and domains:

Marie-Paule Jacques and Nathalie Aussenac-Gilles (2006). "Variabilité
des performances des outils de TAL et genre textuel. Cas des patrons
lexico-syntaxiques". TAL. Volume 47 – n° 1/2006, pp. 11-32

Cheers, Marina

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora