[Corpora-List] Chomsky and computationnel linguistics

Thu Jul 5 14:23:55 UTC 2007

On 7/4/07, Steve Finch <s.finch at daxtra.com> wrote:

    If your goal is to produce a theory of the structure of language
    in terms of the sort of theories of syntax that Chomsky pioneered
    and in the paradigm which he introduced ... I have seen very
    little evidence that the sort of study that goes on in corpus
    linguistics has very much insightful to add to that enterprise.

There are interesting arguments to make regarding the underlying
premise ("If your goal is..."), but there are also uninteresting,
interminable arguments and I'm not going to be the one to start us
down that path.

Instead, Steve, let me tackle your (entirely valid) skepticism head on
by observing that there's an emerging community trying to use the
tools of computational linguistics with corpora to enable linguistic
theoreticians to be more empirical in their approach without
abandoning their paradigm.  Some examples include Muerers
(http://www.ling.ohio-state.edu/~dm/papers/meurers-03.html),
(http://ling.osu.edu/~dm/papers/meurers-mueller-07.pdf), Davies
(http://corpus.byu.edu/), Kilgarriff
(http://www.bultreebank.org/SProLaC/paper06.pdf), and me
(http://lse.umiacs.umd.edu/).  

Work of this kind offers value to theoretical linguists because the
standard paradigm of inventing and judging examples can easily miss
relevant facts, and lead to false generalizations.  Whether or not you
believe strongly in the paradigm of inventing examples and eliciting
grammaticality judgments (see "interminable arguments", above), corpus
data can broaden your view of what phenomena you should be testing.  I
provide one illustration below.

It's worth noting that there is also real momentum building behind
"experimental syntax", a paradigm in which judgments about
grammaticality etc. are obtained using more formal, psychologically
based methods (C. Schuetze, Cowart, Bard, Sorace, Keller, Sprouse,
...).  I would argue that these empirical judgment-gathering methods
are only half of that story; the other half is improving the ways that
theoretical linguists choose sentences about which to elicit
judgments.

  Philip

----

Here's an illustration of how looking at corpus data can help
theoreticians do better linguistics, even within the traditional
paradigm, summarizing from Resnik, Elkiss, Lau, and Taylor, "The Web
in Theoretical Linguistics Research: Two Case Studies Using the
Linguist's Search Engine", Proc. Berkeley Linguistics Society (BLS),
2005, http://www.umiacs.umd.edu/~resnik/pubs/bls2005.pdf.

Resnik et al.  discuss the case of a generalization by McCawley that
was accepted without challenge in the literature for 17 years,
possibly because nobody was equipped to look for a wider range of
examples, or more likely because it never occurred to anyone to do so.
To summarize briefly here, McCawley ("The comparative conditional
constructions in English, German and Chinese", Proc. BLS, 1988)
observed that comparative correlatives can occur with optional
deletion of a main copular verb in each clause, as in (1):

  (1)a  The better an advisor is, the more successful the student is
  (1)b  The better an advisor, the more successful the student

but he argued that this was only licit when the subject of the clause
is generic, rather than specific, as evidenced by (2):

  (2)a  The more obnoxious Fred is, the less attention you should pay to him
  (2)b  *The more obnoxious Fred, the less attention you should pay to him

This has theoretical import because the generic/specific distinction
is semantic, not syntactic (e.g. see Culicover and Jackendoff,
Linguistic Inquiry 30, 1999).  Trouble is, if you actually go look at
comparative correlative constructions in use, you find that McCawley's
generalization can't be taken at face value.  Heather Taylor, a
syntactician, searched for naturally occurring examples of comparative
correlatives on the Web using the Linguist's Search Engine, and she
found three interesting things.  First, while it was true that, in
instances of copula deletion, these constructions did commonly occur
with generics in their subject, it was striking that *every single
instance* of copula deletion included deletion of a main copular verb
in *both* clauses. This suggests the possibility that the
unacceptability of (2b) has something to do with an inability to
delete or retain the copula in parallel, rather than the
specific/generic distinction.  Second, naturally occurring data
suggested that cases like (2b) improve when the specific subject has
greater phonological weight, e.g. (3):

  (3)  The more obnoxious Fred's younger brother, the less attention
       you should pay to him

And third, data from the Web was found in which the second clause is
introduced by "then", supporting a theoretical analysis of this
construction more in line with conditionals than with correlatives.

  (4) The more pizza Romeo eats, then the fatter he gets

Looking at corpus data did not turn Taylor into a corpus linguist, but
it did give her new insights that informed her thinking and her
choices as she developed her work within the traditional theoretical
paradigm.  Whether or not you buy the details of Taylor's particular
theoretical argument (see the paper and her publications for details),
the point here is that looking at a large corpus -- made possible by
linguistically more sophisticated searching -- can turn up things that
theoretical linguists simply won't come up with when they invent
examples.

But here's the catch.  Searching corpora is more work, and the
standard methods are very entrenched, so we have a classic technology
adoption problem.  If we want to change the way theoretical
linguistics is done, we need to make it easier to do things
differently, just as WebExp has done for empirical elicitation from
human subjects (http://www.webexp.info/).  And we need more early
adopters like Heather Taylor to get good results, in order to convince
other theoreticians that it's worth doing.

  Philip Resnik, Associate Professor
  Department of Linguistics and Institute for Advanced Computer Studies
  1401 Marie Mount Hall            UMIACS phone: (301) 405-6760       
  University of Maryland           Linguistics phone: (301) 405-8903
  College Park, MD 20742 USA       Fax: (301) 314-2644 / (301) 405-7104
  http://umiacs.umd.edu/~resnik    E-mail: resnik at umiacs.umd.edu