[Corpora-List] Chomsky and computationnel linguistics

Michael Maxwell maxwell at umiacs.umd.edu
Wed Jul 11 16:35:52 UTC 2007


This will just be a quick reply, can't do more now, but I'm leaving for a
workshop tomorrow, so it will probably be my only reply (I hear that sigh
of relief :-)).

Ramesh Krishnamurthy wrote:
>>>>Work of this kind offers value to theoretical linguists because the
>>>>standard paradigm of inventing and judging examples can easily miss
>>>>relevant facts, and lead to false generalizations.
>
> ...Why continue to invent examples when large corpora provide
> many examples of most events/features? Is judging not subject to
> idiolectal bias?

There are many reasons invented examples might be better than the ones you
find in corpora, e.g. they can be more concise, or more to the point.  And
I don't feel like there's necessarily anything wrong with invented
examples, since I'm just as good a corpus creator as the next person. 
(One problem, which of course does sometimes happen, is when the invented
example or its interpretation is borderline acceptable, and I convince
myself that it's good because it supports my theory.  And of course nearly
any example sounds bad if you repeat it to yourself long enough.  Or good,
depending on your mood :-).)

I wrote:
>>There is a danger in corpus linguistics of conflating dialects...

To which Ramesh Krishnamurthy replied:
> Why is 'conflating dialects' necessarily a danger? Perhaps we want to
> find the common patterns across dialects?

Or perhaps we don't; it all depends on the purpose of the example.  Some
dialects have unusual constructions that throw light on proposed
generalizations.  An example is the dialects of English that have
subjectless for-to constructions, e.g. "I want for to go."  There have
been several papers on this construction in this dialect, and IIRC, all
the work was done on the basis of constructed examples, because there is
very little corpus on this.

Another example, which I would like to investigate some day, is the use of
non-reflexive indirect object (usually benefactive) pronouns in certain
English dialects, like "I'm gonna get me a new truck."  While that
doubtless shows up in some corpora somewhere (since there is so much
English text on-line these days), if I did rely on corpora to study this,
it would probably appear in a general corpus search that this was in free
variation with the reflexive ("I'm gonna get myself a new truck").  In
fact, I believe it is not in variation at all (or only in register
variation), since in certain dialects *only* the non-reflexive form is
found.  (I could be wrong, though--I'm not a native speaker of this
dialect, only an admirer.)

> Isn't there a danger of prioritizing idiolects in "the standard
> paradigm of inventing and judging examples"?

It's a danger or a benefit, depending on what you're trying to do.  This
is not about describing usage, much less "proper" usage; this is about
making theoretical points.  If for example your theory says that
non-reflexive indirect object pronouns cannot be coreferential with the
subject, then the non-reflexive benefactives may be a problem, regardless
of whether they are considered a non-standard usage.

> Is it impossible for a corpus compiler/user to be able to de-select
> non-native-speaker texts if these were
> regarded as undesirable? Such texts are very useful for other types
> of language research.

Right on both accounts.  The problem is being able to reliably distinguish
the non-native speaker texts, for example if you collect your corpus from
the web.  Whereas if I make up my own examples, I know that they're all
from a native speaker of (a particular dialect of) English.

I wrote:
>>And crucial constructions may simply be so rare that you just can't
>>find them in the available corpus.

To which Ramesh Krishnamurthy replied:
> How can extremely rare events be evaluated as 'crucial'?
> Crucial for what or to whom?

Crucial for theoretical linguistics.  If they are rare but reproduceable
(in the sense that a number of native speakers can agree on their
acceptability), then they potentially show something about how we acquire
language.  The argument is that if they're rare, then we're unlikely to
have learned them (unlike the case with irregular verbs, say, which are
largely learned by rote--and this argument has to be justified more than I
can do here).  So if we all (or some significant population) agree that
they're acceptable, but we couldn't have learned them, then they must
somehow be a result of what we have learned, and that may throw
interesting light on how and what we learn when we learn language.  (The
paradigm case of this, of course, is that of parasitic gaps.)

   Mike Maxwell
   CASL



More information about the Corpora mailing list