[Corpora-List] Metrics for corpus "parseability"

Tue Feb 5 16:27:39 UTC 2008

Miles,

I fail to see precisely what the 3-sat paper you reference, which seems to 
refer to the difficulty of solving (or of there being a solution to) certain 
mathematical equations as certain statistics of the form of the equation 
change, have to do with the issue under discussion.  Maybe you intend to draw 
an analogy with language-form statistics such as reading age stats?

For example, there are various well-known "reading age" statistics that count 
the average length of sentences, the length of various words used (or 
correlated stats such as number of syllables).  One is the Flesch readability 
score, another the Bormuth Grade Level (others evident from a quick peruse of 
google).  They comprise somewhat arbitrary looking formulae such as 0.134*ASL 
+ 5.2 * ASW - 2.134 (example made up, but ASL is Average Sentence Length, and 
ASW is Average Syllables per Word).  Other statistics include proportion of 
passive verb forms, proportion of "easy" words, and various other forms of 
linguo-statistical JuJu.  Most of them have the property that if you randomly 
rearrange the words in each sentence the statistic is invariant.  Is this the 
sort of statistic you mean?  I think these statistics are useful as a rule of 
thumb if you can *assume* the input is well formed and generated by a human 
being who is not trying to fool the system.

I think that the consensus position on reading age stats is that the following 
are probably true in general on average:

(1) Shorter sentences are easier to process for both humans and computers.

(2) Sentences of the same length containing shorter words are easier to 
process.

(3) Sentences of the same length containing more closed class words are likely 
to be easier to process. (NB - high correlation w/ (2))

(4) Sentences of the same length exhibiting certain syntactic phenomena such 
as the passive form (maybe evidenced or approximated by the presence of 
certain parts of speach) are likely to be harder to process.

(5) Sentences of the same length containing more common words are likely to be 
easier to process (NB correlation w/ (2) and (3)).

I think that (2), (3) and (5) are correlated and their relative contributions 
need to be teased apart.

Now all of these reading age statistics are at best "rule of thumb" estimates, 
but they are used by some publishers, and hence are likely to have some 
empirical basis (although I have not seen the science).  What relation they 
may have to parsing algorithms is unclear, and there are clearly cases where 
they can be fooled by the mischievous.  However, it might be interesting to 
investigate such statistics to see to what extent they correlate to 
algorithmic measures of grammatical coverage and/or accuracy of a given 
parser, for example.  Such statistics are certainly something to control for 
in any attempt to devise a better and more well-founded statistic for 
"parseability".

- Steve.

On Tuesday 05 February 2008 08:28, Miles Osborne wrote:
> Actually, I think you have misunderstood what I said:  this truly is about
> the data and not about "algorithms".  What I said was that you need to be
> able to understand about the hardness of the sentences themselves, without
> reference to the parser etc.  Read that sample paper and you will know what
> I mean.
>
> Miles
>
> On 05/02/2008, Adam Kilgarriff <adam at lexmasterclass.com> wrote:
> > On 04/02/2008, Miles Osborne <miles at inf.ed.ac.uk> wrote:
> > > I must confess, the idea that a corpus can be described in terms of
> > > "parseability" sounds a little ill-founded to me.  The choice of
> > > particular parsing algorithm may dictate which examples are hard to
> > > process, as will the underlying grammar etc etc.
> >
> > I couldn't disagree more.  It's the equivalent of saying that it's
> > ill-founded to evaluate parsers because they will always perform
> > differently on different corpora. It just goes to show that you're
> > interested in algorithms not data.  The field is way imbalanced by people
> > who think more about algorithms than the corpora they apply them to.
> >
> > Adam
> >
> >
> > --
> >
> > > ================================================
> > > Adam Kilgarriff
> > > http://www.kilgarriff.co.uk
> > > Lexical Computing Ltd                   http://www.sketchengine.co.uk
> > > Lexicography MasterClass Ltd      http://www.lexmasterclass.com
> > > Universities of Leeds and Sussex       adam at lexmasterclass.com
> > > ================================================

-- 
Steven Finch
Daxtra Technologies
Tel: +44 (0)131 653 1250
Email: s.finch at daxtra.com

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora