15.2343, Disc: Sum: Linguist 15.2332: Survey Results, Hibino

Fri Aug 20 02:20:09 UTC 2004

LINGUIST List:  Vol-15-2343. Thu Aug 19 2004. ISSN: 1068-4875.

Subject: 15.2343, Disc: Sum: Linguist 15.2332: Survey Results, Hibino

Moderators: Anthony Aristar, Wayne State U.<aristar at linguistlist.org>
            Helen Dry, Eastern Michigan U. <hdry at linguistlist.org>

Reviews (reviews at linguistlist.org):
	Sheila Collberg, U. of Arizona
	Terence Langendoen, U. of Arizona

Home Page:  http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.

Editor for this issue: Naomi Fox <fox at linguistlist.org>
 ==========================================================================
To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.
=================================Directory=================================

1)
Date:  Thu, 19 Aug 2004 09:52:06 -0400
From:  Mike Maxwell <maxwell at ldc.upenn.edu>
Subject:  Disc: Re: 15.2332, Sum: 'Who' & 'What' in Subject-verb Concord

-------------------------------- Message 1 -------------------------------

Date:  Thu, 19 Aug 2004 09:52:06 -0400
From:  Mike Maxwell <maxwell at ldc.upenn.edu>
Subject:  Disc: Re: 15.2332, Sum: 'Who' & 'What' in Subject-verb Concord

In some recent issues of Linguist List (most recently Linguist
15.2322), Hideo Hibino posted the results of a survey of agreement
done through LL.  I couldn't help but notice two of the responses:

> I (AmE)(No judgements given) Try using a large database of spoken
> and written English and find out how language is really used.

and

> ...(4) sounds less awful than the others. Go to some electronic
> corpora.  That is more reliable than people's judgements.

This is a common dispute, and there's a lot of water under this
particular bridge.  Nevertheless, I feel compelled to comment.

What you get when you look at corpora of _written_ language is, by
definition, how the written language is _used_.  Whether corpora of
_spoken_ English represents how the spoken language is used depends on
how it was transcribed--it is not uncommon to not transcribe
hesitations, for example.  It depends on why the transcript was made,
how much time was invested, who did the transcription, etc.

But asking how language is _used_ is akin to asking how cars are used.
If you go to the junkyard, you'll find some of the ways cars are used.
That may not help you, though, if you want to know how cars work.  Or
you could look at how cars get into accidents--again, that may be a nice
way to find out about airbags, and maybe about how drunks drive, but it
may not be the most enlightening way to find out how cars work.

Similarly, if you want to know how language works, looking at corpora is
one way.  The problem is, it's a mixed bag.  You'll get dialect mixtures
that you can't always sort out, whereas a survey of the sort Hideo did
can give you that dialect information (and in Hideo's survey, indeed
revealed an interesting pattern).  (Whether you can sort it out in a
corpus of course depends on how the corpus was collected.)

Written, and even more so spoken, corpora will also give you mistakes.
Sometimes that's exactly what you want: there are collections of speech
errors, for example, that presumably show something of the way the brain
processes language.  And of course there's the question of just what a
'mistake' is: is it just that the user would have, if given time, come
up with a better wording?  Or is it an attempt to conform to what they
remember their 5th grade English teacher taught them?  Or on the other
hand, is it a genuine error, which happened because the author's finger
slipped, or someone came into the office in the middle of a sentence, or
a speaker choked on their lunch, or they were distracted by music they
were listening to, or a later editor changed something they didn't like,
or...  Many of these genuine errors would be corrected if the speaker
was given a chance, and this is precisely what a survey (or other sorts
of introspective evidence) allows.

In sum, I would claim that there is room for corpora evidence, but there
is also plenty of room for introspection and surveys.  Saying that one
is more 'reliable' than the other is like asking whether beef or oranges
are 'better food': it depends.

Mike Maxwell
Linguistic Data Consortium
maxwell at ldc.upenn.edu

---------------------------------------------------------------------------
LINGUIST List: Vol-15-2343