<div>Sean,</div>

<div> </div>

<div>very interesting question.  We are approaching a similar task with similar methods: using RASP, if we set the timeout threshold to 1s, how many sentences time out?  We're also planning something similar with Clark&Curran parser.</div>


<div> </div>

<div>> Are there standard or widely accepted metrics for describing the<br>> well-behavedness of corpora?<br> </div>

<div>The answer is, I think, a resounding 'no'.  There is disappointingly little work on systematically comparing corpora, or making objective general observations of one corpus in comparison to others.  (Citations proving me wrong are most welcome.  I'm aware of Sekine,  Roland and Jurafsky, Cavaglia, also work on genre by eg Karlgren, Santini, Sharoff, which touches on the topic)</div>


<div><br>Adam<br><br>================================================<br>Adam Kilgarriff                                      <a href="http://www.kilgarriff.co.uk">http://www.kilgarriff.co.uk</a>              <br>Lexical Computing Ltd                   <a href="http://www.sketchengine.co.uk">http://www.sketchengine.co.uk</a><br>

Lexicography MasterClass Ltd      <a href="http://www.lexmasterclass.com">http://www.lexmasterclass.com</a><br>Universities of Leeds and Sussex       <a href="mailto:adam@lexmasterclass.com">adam@lexmasterclass.com</a><br>

================================================ </div>

<div><span class="gmail_quote">On 01/02/2008, <b class="gmail_sendername">Sean Igo</b> <<a href="mailto:sgigo@xmission.com">sgigo@xmission.com</a>> wrote:</span>

<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">Good day,<br><br>I'm working on a project in which we are attempting to characterize a<br>few different corpora according to how "well-behaved" they are. That is,<br>

we want to show that some are more amenable in particular to parsing and<br>part-of-speech tagging than others. Some of the corpora consist of<br>complete, grammatical sentences and others are telegraphic, fragmentary<br>

text including a large number of abbreviations and misspellings.<br><br>One approach I've tried is to tag and parse each of the corpora with the<br>Stanford tagger and parser, generating ranked lists of the unique tokens<br>

and tags and looking for certain errors / warnings / phrase structures<br>in the parser output. For instance, I'm counting how many sentences the<br>parser had to retry, how many it failed to find any parse for, how many<br>

it ran out of memory while processing, and how many FRAG (sentence<br>fragment) phrases are found in the parser output.<br><br>Are there standard or widely accepted metrics for describing the<br>well-behavedness of corpora?<br>

<br>Many thanks,<br>Sean Igo<br><br>_______________________________________________<br>Corpora mailing list<br><a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br><a href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a><br>

</blockquote></div><br><br clear="all"><br>--