[Corpora-List] Metrics for corpus "parseability"

Mon Feb 4 00:46:44 UTC 2008

Sean,
There is also some possibly relevant work relating to finding sources of  difficulty for a particular parser, for example
Gertjan van Noord. Error Mining for Wide-Coverage Grammar Engineering. In: ACL 2004, Barcelona  The latter (roughly) finds n-grams most often associated with a failure to obtain a complete parse, and might be adapted to predict in advance the well-behavedness of a corpus with respect to a particular parser and grammar.

In a similar vein, I've  recently experimented with predicting parser error--not just failure to parse--using some stable features, that is, ones that will not change as the parser is improved.  The experiments did not use noisy data.  As might be expected, the most important features were found to be sentence length (in terms of number of tokens),  the normalized (per-token) parse speed,  the number of basic chunks identified, and the preference score associated with the best parse (because the parser is preference-based).

Paula
- The number of chunks (ChunkCt) should be
> From: Sean Igo <sgigo at xmission.com>
> To: <CORPORA at UIB.NO>
> Date: 2/1/2008 5:02:02 PM
> Subject: [Corpora-List] Metrics for corpus "parseability"
>
> Good day,
>
> I'm working on a project in which we are attempting to characterize a 
> few different corpora according to how "well-behaved" they are. That is, 
> we want to show that some are more amenable in particular to parsing and 
> part-of-speech tagging than others. Some of the corpora consist of 
> complete, grammatical sentences and others are telegraphic, fragmentary 
> text including a large number of abbreviations and misspellings.
>
> One approach I've tried is to tag and parse each of the corpora with the 
> Stanford tagger and parser, generating ranked lists of the unique tokens 
> and tags and looking for certain errors / warnings / phrase structures 
> in the parser output. For instance, I'm counting how many sentences the 
> parser had to retry, how many it failed to find any parse for, how many 
> it ran out of memory while processing, and how many FRAG (sentence 
> fragment) phrases are found in the parser output.
>
> Are there standard or widely accepted metrics for describing the 
> well-behavedness of corpora?
>
> Many thanks,
> Sean Igo
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080203/1552c5cb/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora