[Corpora-List] Metrics for corpus "parseability"
Steve Finch
s.finch at daxtra.com
Tue Feb 5 16:27:39 UTC 2008
Miles,
I fail to see precisely what the 3-sat paper you reference, which seems to
refer to the difficulty of solving (or of there being a solution to) certain
mathematical equations as certain statistics of the form of the equation
change, have to do with the issue under discussion. Maybe you intend to draw
an analogy with language-form statistics such as reading age stats?
For example, there are various well-known "reading age" statistics that count
the average length of sentences, the length of various words used (or
correlated stats such as number of syllables). One is the Flesch readability
score, another the Bormuth Grade Level (others evident from a quick peruse of
google). They comprise somewhat arbitrary looking formulae such as 0.134*ASL
+ 5.2 * ASW - 2.134 (example made up, but ASL is Average Sentence Length, and
ASW is Average Syllables per Word). Other statistics include proportion of
passive verb forms, proportion of "easy" words, and various other forms of
linguo-statistical JuJu. Most of them have the property that if you randomly
rearrange the words in each sentence the statistic is invariant. Is this the
sort of statistic you mean? I think these statistics are useful as a rule of
thumb if you can *assume* the input is well formed and generated by a human
being who is not trying to fool the system.
I think that the consensus position on reading age stats is that the following
are probably true in general on average:
(1) Shorter sentences are easier to process for both humans and computers.
(2) Sentences of the same length containing shorter words are easier to
process.
(3) Sentences of the same length containing more closed class words are likely
to be easier to process. (NB - high correlation w/ (2))
(4) Sentences of the same length exhibiting certain syntactic phenomena such
as the passive form (maybe evidenced or approximated by the presence of
certain parts of speach) are likely to be harder to process.
(5) Sentences of the same length containing more common words are likely to be
easier to process (NB correlation w/ (2) and (3)).
I think that (2), (3) and (5) are correlated and their relative contributions
need to be teased apart.
Now all of these reading age statistics are at best "rule of thumb" estimates,
but they are used by some publishers, and hence are likely to have some
empirical basis (although I have not seen the science). What relation they
may have to parsing algorithms is unclear, and there are clearly cases where
they can be fooled by the mischievous. However, it might be interesting to
investigate such statistics to see to what extent they correlate to
algorithmic measures of grammatical coverage and/or accuracy of a given
parser, for example. Such statistics are certainly something to control for
in any attempt to devise a better and more well-founded statistic for
"parseability".
- Steve.
On Tuesday 05 February 2008 08:28, Miles Osborne wrote:
> Actually, I think you have misunderstood what I said: this truly is about
> the data and not about "algorithms". What I said was that you need to be
> able to understand about the hardness of the sentences themselves, without
> reference to the parser etc. Read that sample paper and you will know what
> I mean.
>
> Miles
>
> On 05/02/2008, Adam Kilgarriff <adam at lexmasterclass.com> wrote:
> > On 04/02/2008, Miles Osborne <miles at inf.ed.ac.uk> wrote:
> > > I must confess, the idea that a corpus can be described in terms of
> > > "parseability" sounds a little ill-founded to me. The choice of
> > > particular parsing algorithm may dictate which examples are hard to
> > > process, as will the underlying grammar etc etc.
> >
> > I couldn't disagree more. It's the equivalent of saying that it's
> > ill-founded to evaluate parsers because they will always perform
> > differently on different corpora. It just goes to show that you're
> > interested in algorithms not data. The field is way imbalanced by people
> > who think more about algorithms than the corpora they apply them to.
> >
> > Adam
> >
> >
> > --
> >
> > > ================================================
> > > Adam Kilgarriff
> > > http://www.kilgarriff.co.uk
> > > Lexical Computing Ltd http://www.sketchengine.co.uk
> > > Lexicography MasterClass Ltd http://www.lexmasterclass.com
> > > Universities of Leeds and Sussex adam at lexmasterclass.com
> > > ================================================
--
Steven Finch
Daxtra Technologies
Tel: +44 (0)131 653 1250
Email: s.finch at daxtra.com
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list