[Corpora-List] Metrics for corpus "parseability"
Sean Igo
sgigo at xmission.com
Fri Feb 1 17:04:32 UTC 2008
Good day,
I'm working on a project in which we are attempting to characterize a
few different corpora according to how "well-behaved" they are. That is,
we want to show that some are more amenable in particular to parsing and
part-of-speech tagging than others. Some of the corpora consist of
complete, grammatical sentences and others are telegraphic, fragmentary
text including a large number of abbreviations and misspellings.
One approach I've tried is to tag and parse each of the corpora with the
Stanford tagger and parser, generating ranked lists of the unique tokens
and tags and looking for certain errors / warnings / phrase structures
in the parser output. For instance, I'm counting how many sentences the
parser had to retry, how many it failed to find any parse for, how many
it ran out of memory while processing, and how many FRAG (sentence
fragment) phrases are found in the parser output.
Are there standard or widely accepted metrics for describing the
well-behavedness of corpora?
Many thanks,
Sean Igo
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list