[Corpora-List] agent and patient probabilities

Wed Jan 24 14:43:05 UTC 2007

Jim,

> For some experiments, we need agent-verb-patient triples where the 
> "goodness" of the agents and patients to the verb vary in strength. 
> Typical ways to develop materials for such studies is by having human
>  subjects rate how "good" various items are as agents and patients
> for particular verbs (e.g., "how likely is a dog to walk?", "how
> likely is a dog to be walked?"). While this works well, it's of
> course very labor (and subject) intensive. So I'm hoping to automate
> this.

Philip Resnik's work is definitely an excellent place to look.

Beyond that, my work on modelling human language processing might also
be of interest to you. One large part of my PhD work (the thesis was
submitted recently) was to build a model that predicts human judgements
about the plausibility of verb-argument-relation triples.

Key differences to Resnik's work are a generative formulation (i.e.,
plausible roles and arguments can be straightforwardly generated given a
verb) and the use of thematic roles to define the relation between verb
and argument. We tested the model against literature norming data (e.g.,
McRae et al. 1998, Trueswell et al. 1994) and against norms we elicited
ourselves for verb-argument-role triples extracted from corpora.

Details can be found in
U. Pado, M. Crocker and F. Keller, Modelling Semantic Role Plausibility
in Human Sentence Processing. EACL, Trento, 2006.
and
U. Pado, F. Keller and M. Crocker, Combining Syntax and Thematic Fit in
a Probabilistic Model of Sentence Processing. CogSci, Vancouver, 2006.

In the thesis, I also do a comparison to Philip Resnik's and two other
selectional preference models on the sets of norming data I mentioned. I
replicate Resnik's original successful evaluation, but our model tends
to do even a bit better at predicting plausibility judgements across the
different data sets.

If you'd like more information or have any questions, please let me know :)

> I know about the Penn Treebank; are there better and/or less
> expensive options for US English, or is this just the way to go?

It might be worthwhile to use a role-annotated corpus to make sure you
really catch the verb-argument relations you're after.

The PropBank (role annotations to parts of the Penn Treebank) is the
largest role-annotated corpus available, and it's American English, but
you may want to have a look at the FrameNet corpus as well. It's a
subset of the British National Corpus, and therefore much more balanced
in vocabulary.  For example, I find that its vocabulary is closer to
"typical" psycholinguistic items than that of PropBank with its bias
towards financial language.

The FrameNet home page is at http://framenet.icsi.berkeley.edu/, and if
I understand correctly, the corpus is free for research purposes.

Regards,

Ulrike

-- 
  Ulrike Pado

  Computational Linguistics
  Saarland University
  D-66041 Saarbrücken

  www.coli.uni-sb.de/~ulrike