analysis: unhappiness

Fri Sep 10 13:03:20 UTC 2010

Dear Dan, Dick:

I would like to clarify some points that Dan Everett makes, in  
response to Dick Hudson.

Ev Fedorenko and I have written a couple of papers recently (Gibson &  
Fedorenko, 2010, in press, see references and links below) on what we  
think are weak methodological standards in syntax and semantics  
research over the past many years.  The issue that we address is the  
prevalent method in syntax and semantics research, which involves  
obtaining a judgment of the acceptability of a sentence / meaning  
pair, typically by just the author of the paper, sometimes with  
feedback from colleagues.  As we address in our papers, this  
methodology does not allow proper testing of scientific hypotheses  
because of (a) the small number of experimental participants  
(typically one); (b) the small number of experimental stimuli  
(typically one); (c) cognitive biases on the part of the researcher  
and participants; and (d) the effect of the preceding context (e.g.,  
other constructions the researcher may have been recently  
considering).  (As Dan said, see Schutze, 1996; Cowart, 1997; and  
several others cited in Gibson & Fedorenko, in press; for similar  
points, but with not as strong a conclusion as ours).

Three issues need to be separated here: (1) the use of intuitive  
judgments as a dependent measure in a language experiment; (2)  
potential cognitive biases on the part of experimental subjects and  
experimenters in language experiments; and (3) the need for obtaining  
quantitative evidence, whatever the dependent measure might be.  The  
paper that Ev and I wrote addresses the last two issues, but does not  
go into depth on the first issue (the use of intuitions as a dependent  
measure in language experiments).  Regarding this issue, we don't  
think that there is anything wrong with gathering intuitive judgments  
as a dependent measure, as long as the task is clear to the  
experimental participants.

In the longer paper (Gibson & Fedorenko, in press) we respond to some  
arguments that have been given in support of continuing to use the  
traditional non-quantitative method in syntax / semantics research.   
One recent defense of the traditional method comes from Phillips  
(2008), who argues that no harm has come from the non-quantitative  
approach in syntax research thus far.  Phillips argues that there are  
no cases in the literature where an incorrect intuitive judgment has  
become the basis for a widely accepted generalization or an important  
theoretical claim.  He therefore concludes that there is no reason to  
adopt more rigorous data collection standards.  We challenge Philips’  
conclusion by presenting three cases from the literature where a  
faulty intuition has led to incorrect generalizations and mistaken  
theorizing, plausibly due to cognitive biases on the part of the  
researchers.

A second argument that is sometimes presented for the continued use of  
the traditional non-quantitative method is that it would be too  
inefficient to evaluate every syntactic / semantic hypothesis or  
phenomenon quantitatively.  For example, Culicover & Jackendoff (2010)  
make this argument explicitly in their response to Gibson & Fedorenko  
(2010): “It would cripple linguistic investigation if it were required  
that all judgments of ambiguity and grammaticality be subject to  
statistically rigorous experiments on naive subjects, especially when  
investigating languages whose speakers are hard to access” (Culicover  
& Jackendoff, 2010, p. 234).  (Dick Hudson makes a similar point  
earlier in the discussion here.)  Whereas we agree that in  
circumstances where gathering data is difficult, some evidence is  
better than no evidence, we do not agree that research would be slowed  
with respect to languages where experimental participants are easy to  
access, such as English.  In contrast, we think that the opposite is  
true: the field’s progress is probably slowed by not doing  
quantitative research.
Suppose that a typical syntax / semantics paper that lacks  
quantitative evidence includes judgments for 50 or more sentences /  
meaning pairs, corresponding to 50 or more empirical claims.  Even if  
most of the judgments from such a paper are correct or are on the  
right track, the problem is in knowing which judgments are correct.   
For example, suppose that 90% of the judgments from an arbitrary paper  
are correct (which is probably a high estimate).  (Colin Phillips and  
some of his former students / postdocs have commented to us that, in  
their experience, quantitative acceptability judgment studies almost  
always validate the claim(s) in the literature.  This is not our  
experience, however.  Most experiments that we have run which attempt  
to test some syntactic / semantic hypothesis in the literature end up  
providing us with a pattern of data that had not been known before the  
experiment (e.g., Breen et al., in press; Fedorenko & Gibson, in  
press; Patel et al., 2009; Scontras & Gibson, submitted).) This means  
that in a paper with 50 empirical claims 45/50 are correct.  But which  
45?  There are 2,118, 760 ways to choose 45 items from 50.  That’s  
over two million different theories.  By quantitatively evaluating the  
empirical claims, we reduce the uncertainty a great deal.  To make  
progress, it is better to have theoretical claims supported by solid  
quantitative evidence, so that even if the interpretation of the data  
changes over time as new evidence becomes available – as is often the  
case in any field of science – the empirical pattern can be used as a  
basis for further theorizing.

Furthermore, it is no longer expensive to run behavioral experiments,  
at least in English and other widely spoken languages.  There now  
exists a marketplace interface – Amazon.com’s Mechanical Turk – which  
can be used for collecting behavioral data over the internet quickly  
and inexpensively.  The cost of using an interface like this is  
minimal, and the time that it takes for the results to be returned is  
short.  For example, currently on Mechanical Turk, a survey of  
approximately 50 items will be answered by 50 or more participants  
within a couple of hours, at a cost of approximately $1 per  
participant.  Thus a survey can be completed within a day, at a cost  
of less than $50.  (The hard work of designing the experiment, and  
constructing controlled materials remains of course.)

Sorry to be so verbose.  But I think that these methodological points  
are very important.

Best wishes,

Ted Gibson

Gibson, E. & Fedorenko, E. (In press). The need for quantitative  
methods in syntax and semantics research. Language and Cognitive  
Processes.  http://tedlab.mit.edu/tedlab_website/researchpapers/Gibson  
& Fedorenko InPress LCP.pdf

Gibson, E. & Fedorenko, E. (2010). Weak quantitative standards in  
linguistics research.  Trends in Cognitive Science, 14, 233-234.
http://tedlab.mit.edu/tedlab_website/researchpapers/Gibson & Fedorenko  
2010 TICS.pdf

> Dick,
>
> You raise an important issue here about methodology. I believe that  
> intuitions are a fine way to generate hypotheses and even to test  
> them - to a degree. But while it might not have been feasible for  
> Huddleston, Pullum, and the other contributors to the Cambridge  
> Grammar to conduct experiments on every point of the grammar,  
> experiments could have only made the grammar better. The use of  
> intuitions, corpora, and standard psycholinguistic experimentation  
> (indeed, Standard Social Science Methodology)  is vital for taking  
> the field forward and for providing the best support for different  
> analyses. Ted Gibson and Ev Fedorenko have written a very useful new  
> paper on this, showing serious shortcomings with intuitions as the  
> sole source of evidence, in their paper: "The need for quantitative  
> methods in syntax and semantics research".
>
> Carson Schutze and Wayne Cowart, among others, have also written  
> convincingly on this.
>
> It is one reason that a team from Stanford, MIT (Brain and Cognitive  
> Science), and researchers from Brazil are beginning a third round of  
> experimental work among the Pirahas, since my own work on the syntax  
> was, like almost every other field researcher's, based on native  
> speaker intuitions and corpora.
>
> The discussion of methodologies reminds me of the initial reactions  
> to Greenberg's work on classifying the languages of the Americas.  
> His methods were strongly (and justifiably) criticized. However, I  
> always thought that his methods were a great way of generating  
> hypotheses, so long as they were ultimately put to the test of  
> standard historical linguistics methods. And the same seems true for  
> use of native-speaker intuitions.
>
> -- Dan

>> We linguists can add a further layer of explanation to the  
>> judgements, but some judgements do seem to be more reliable than  
>> others. And if we have to wait for psycholinguistic evidence for  
>> every detailed analysis we make, our whole discipline will  
>> immediately grind to a halt. Like it or not, native speaker  
>> judgements are what put us linguists ahead of the rest in handling  
>> fine detail. Imagine writing the Cambridge Grammar of the English  
>> Language (or the OED) without using native speaker judgements.
>>
>> Best wishes,  Dick Hudson