[Corpora-List] Metrics for corpus "parseability"

Sun Feb 3 14:40:45 UTC 2008

Hi Sean,

You could start with something straightforward such as evaluating
coverage against a lexicon across the various corpora. In LREC2004, we
looked at this when evaluating a semantic tagger over written, spoken,
domain-specific and historical corpora:

Piao, Scott S. L., Paul Rayson, Dawn Archer, Tony McEnery (2004).
Evaluating Lexical Resources for A Semantic Tagger. In proceedings of
4th International Conference on Language Resources and Evaluation (LREC
2004), May 2004, Lisbon, Portugal, Volume II, pp. 499-502.

http://www.comp.lancs.ac.uk/computing/users/paul/publications/pram_lrec0
4.pdf

Related to misspellings (actually historical variants) and grammatical
variation across corpora (over time), we've recently compared the
accuracy of CLAWS on modern and historical corpora:

Rayson, P., Archer, D., Baron, A., Culpeper, J. and Smith, N. (2007).
Tagging the Bard: Evaluating the accuracy of a modern POS tagger on
Early Modern English corpora. In proceedings of Corpus Linguistics 2007,
July 27-30, University of Birmingham, UK.

http://ucrel.lancs.ac.uk/publications/CL2007/paper/192_Paper.pdf

Regards,

Paul.

Dr. Paul Rayson

Director of UCREL

Computing Department, Infolab21, South Drive, Lancaster University,
Lancaster, LA1 4WA, UK.

Web: http://www.comp.lancs.ac.uk/computing/users/paul/
<http://www.comp.lancs.ac.uk/computing/users/paul/> 

Tel: +44 1524 510357 Fax: +44 1524 510492

________________________________

From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf
Of Adam Kilgarriff
Sent: 03 February 2008 06:56
To: Sean Igo
Cc: CORPORA at uib.no
Subject: Re: [Corpora-List] Metrics for corpus "parseability"

Sean,

very interesting question.  We are approaching a similar task with
similar methods: using RASP, if we set the timeout threshold to 1s, how
many sentences time out?  We're also planning something similar with
Clark&Curran parser.

> Are there standard or widely accepted metrics for describing the
> well-behavedness of corpora?

The answer is, I think, a resounding 'no'.  There is disappointingly
little work on systematically comparing corpora, or making objective
general observations of one corpus in comparison to others.  (Citations
proving me wrong are most welcome.  I'm aware of Sekine,  Roland and
Jurafsky, Cavaglia, also work on genre by eg Karlgren, Santini, Sharoff,
which touches on the topic)

Adam

================================================
Adam Kilgarriff
http://www.kilgarriff.co.uk              
Lexical Computing Ltd                   http://www.sketchengine.co.uk
Lexicography MasterClass Ltd      http://www.lexmasterclass.com
Universities of Leeds and Sussex       adam at lexmasterclass.com
================================================ 

On 01/02/2008, Sean Igo <sgigo at xmission.com> wrote: 

Good day,

I'm working on a project in which we are attempting to characterize a
few different corpora according to how "well-behaved" they are. That is,
we want to show that some are more amenable in particular to parsing and
part-of-speech tagging than others. Some of the corpora consist of
complete, grammatical sentences and others are telegraphic, fragmentary
text including a large number of abbreviations and misspellings.

One approach I've tried is to tag and parse each of the corpora with the
Stanford tagger and parser, generating ranked lists of the unique tokens
and tags and looking for certain errors / warnings / phrase structures
in the parser output. For instance, I'm counting how many sentences the
parser had to retry, how many it failed to find any parse for, how many
it ran out of memory while processing, and how many FRAG (sentence
fragment) phrases are found in the parser output.

Are there standard or widely accepted metrics for describing the
well-behavedness of corpora?

Many thanks,
Sean Igo

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora

-- 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080203/81223434/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora