[Corpora-List] Metrics for corpus "parseability"

Adam Kilgarriff adam at lexmasterclass.com
Sun Feb 3 06:56:29 UTC 2008


Sean,

very interesting question.  We are approaching a similar task with similar
methods: using RASP, if we set the timeout threshold to 1s, how many
sentences time out?  We're also planning something similar with Clark&Curran
parser.

> Are there standard or widely accepted metrics for describing the
> well-behavedness of corpora?

The answer is, I think, a resounding 'no'.  There is disappointingly little
work on systematically comparing corpora, or making objective general
observations of one corpus in comparison to others.  (Citations proving me
wrong are most welcome.  I'm aware of Sekine,  Roland and Jurafsky,
Cavaglia, also work on genre by eg Karlgren, Santini, Sharoff, which touches
on the topic)

Adam

================================================
Adam Kilgarriff
http://www.kilgarriff.co.uk
Lexical Computing Ltd                   http://www.sketchengine.co.uk
Lexicography MasterClass Ltd      http://www.lexmasterclass.com
Universities of Leeds and Sussex       adam at lexmasterclass.com
================================================
On 01/02/2008, Sean Igo <sgigo at xmission.com> wrote:
>
> Good day,
>
> I'm working on a project in which we are attempting to characterize a
> few different corpora according to how "well-behaved" they are. That is,
> we want to show that some are more amenable in particular to parsing and
> part-of-speech tagging than others. Some of the corpora consist of
> complete, grammatical sentences and others are telegraphic, fragmentary
> text including a large number of abbreviations and misspellings.
>
> One approach I've tried is to tag and parse each of the corpora with the
> Stanford tagger and parser, generating ranked lists of the unique tokens
> and tags and looking for certain errors / warnings / phrase structures
> in the parser output. For instance, I'm counting how many sentences the
> parser had to retry, how many it failed to find any parse for, how many
> it ran out of memory while processing, and how many FRAG (sentence
> fragment) phrases are found in the parser output.
>
> Are there standard or widely accepted metrics for describing the
> well-behavedness of corpora?
>
> Many thanks,
> Sean Igo
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080203/9d2f7c16/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list