[Corpora-List] Metrics for corpus "parseability"

Fri Feb 1 17:04:32 UTC 2008

Good day,

I'm working on a project in which we are attempting to characterize a 
few different corpora according to how "well-behaved" they are. That is, 
we want to show that some are more amenable in particular to parsing and 
part-of-speech tagging than others. Some of the corpora consist of 
complete, grammatical sentences and others are telegraphic, fragmentary 
text including a large number of abbreviations and misspellings.

One approach I've tried is to tag and parse each of the corpora with the 
Stanford tagger and parser, generating ranked lists of the unique tokens 
and tags and looking for certain errors / warnings / phrase structures 
in the parser output. For instance, I'm counting how many sentences the 
parser had to retry, how many it failed to find any parse for, how many 
it ran out of memory while processing, and how many FRAG (sentence 
fragment) phrases are found in the parser output.

Are there standard or widely accepted metrics for describing the 
well-behavedness of corpora?

Many thanks,
Sean Igo

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora