Corpora: Evidence and intuition

Ivo Sanchez isanchez at conversay.com
Thu Nov 1 18:50:14 UTC 2001


Hi David and Patrick,
I just want to contribute with my two cents from a functional discourse
perspective:
 In my experience, the most interesting stuff I have encountered as
linguistic phenomena in languages for which I have native or pseudo-native
intuitions such as English or Spanish I have always found in corpora (and
usualy while I was looking for something else).I have in mind things like
the discourse function of verbs, the paths of grammaticalization of a
complex prepostion, degrees of co-lexicalization of constructions etc. I
cannot imagine getting at this type of phenomena using introspection. And so
often, trying to hypothesize somewhere where a corpus were not available, I
find latter that I was wrong i.e. my intuitions were wrong.
 If we use intuitions without checking them against real data, how can we
ever be sure that those are right? I believe that the experience of
disagreeing with the "?'s" and the "*'s" of examples in formal linguistic
studies is quite common.
I also want to remind you of 'linguistic ideology', how do I imagine how my
native language is based on many ideas that I learned in school is not
necessary how my native language is.
In general I still have a hard time understanding how a science can
hypothesize about its object of study without looking at such an object
-naturally ocurring data-.
 
   Ivo Sánchez
Language Development. Syntax-Lexicon
Phone 425.636.0706 x221
http://www.conversay.com

-----Original Message-----
From: David Wible [mailto:dwible at mail.tku.edu.tw]
Sent: Wednesday, October 31, 2001 8:59 PM
To: Patrick Hanks; corpora at hd.uib.no
Cc: CPA
Subject: Re: Corpora: Evidence and intuition


Patrick mentions theoretical linguists using the doubts about the
representativeness of corpora to cast doubt on results or conclusions drawn
by linguists using corpora.  In my experience, the criticisms are more
likely of almost the opposite sort: that is, those inclined to criticize
corpus research suggest that some of the most interesting phenomena about a
speaker's knowledge is not the stuff that s/he hears examples of often
(stuff that would presumably then occur frequently in corpora), but the
reverse: strong intuitions about uses they perhaps have never encountered.
Isn't it these sorts of data and not those that are amply represented in
'representative' corpora that make us ask: how did they ever come to know
that?

David Wible


----- Original Message -----
From: "Patrick Hanks" <patrick at lingomotors.com>
To: <corpora at hd.uib.no>
Cc: "CPA" <CPA at lingomotors.com>
Sent: Thursday, November 01, 2001 7:32 AM
Subject: Corpora: Evidence and intuition



A late contribution to the discussion sparked by Sebastian Hoffmann:

I recently asked a few colleagues who are not corpus linguists to make
up
a couple of natural sentences using the word "total" as verb. The
answers
typically fall into two classes:

1. [[Driver]] total [[Vehicle]]
   e.g. Carina totaled the car.

2. [[Person]] total [[Number]]
   e.g. John totaled the column of figures.

In the British and American corpora that we are currently using (in
particular
BNC, Reuters, and 4 years of AP), sense 1 accounts for less than 1% of
uses
of the verb and sense 2 is even rarer - perfectly plausible, but next to
non-
existent.

Over 98% of corpus uses of this verb fall into the following pattern:

3. [[Entity (often plural)]] total [[Number | Amount]]
   e.g. Sales totaled 6 million.

Why did this *very* common pattern of use not spring immediately to the
minds of ordinary native speakers of british or American English?
Hypotheses
include:

a) Introspection as a technique favors human subject roles.
b) 3 is really a copula, "not a real verb".
c) There is an inverse relationship between cognitive salience
and
    social salience

Re 3, see (Hanks 1990), where I argued that people register the odd or
unusual
and fail to register what we do regularly or continuously.  (Think of
someone
putting his/her hand on your arm.  Now think of someone having had
his/her hand
on your  arm all afternoon.)

Whatever the reason, the phenomenon is a familiar one in lexical
analysis,
first noticed by Cobuilders working on the Cobuild 7.3 million word
corpus
in about 1983. Of course, 'total' is a fairly dramatic example, but
other less
dramatic cases abound, e.g. the "delexical verbs" (known in America as
"light
verbs).  Ask people to make up examples for common uses of "take" and
very
few of them will think of [[Duration]]:

4. How long will it take?

5. It only took a few minutes.

Interestingly, the phenomenon is occasionally denied by some theoretical

linguists and other intelligent people, corpus evidence to the contrary
notwithstanding. The opening shot is usually "Your corpus is not
representative" (?!).  Why do they do this?  Surely it cannot be as
simple
as wishing to preserve  introspection as a research technique?


Patrick



More information about the Corpora mailing list