[Lingtyp] spectrograms in linguistic description and for language comparison

Sun Dec 11 03:00:05 UTC 2022

I think Randy is wrong (sorry if this comes across as blunt) and so I
am writing, on a Saturday night no less, to voice a different view.

Working inductively from a corpus is great, but no corpus is ever
going to be large enough to fully represent a given language's
grammatical possibilities. If we limit ourselves to working
inductively from corpora then many basic questions about the languages
we research will go unanswered. From a corpus of natural data we
simply cannot know whether a given pattern is missing because the
corpus is finite (i.e., it's just a statistical accident that the
pattern isn't attested) or whether there's a genuine reason why the
pattern is not showing up (i.e., its non-attestation is principled).

When I am writing up my research on Tuparí I always prioritize
non-elicited data (texts, in-person conversation, WhatsApp chats). But
interpreting and analyzing the non-elicited data requires making
reference to acceptability judgments. The prefix (e)tareman- is a
negative polarity item, and it always co-occurs with (and inside the
scope of) a negator morpheme. But the only way I can make this point
is by showing that speakers invariably reject tokens of (e)tareman-
without a licensing negator. Those rejected examples are by definition
not going to be present in any corpus of naturalistic speech, but they
tell me something crucial about what the structure of Tuparí does and
does not allow. If I limit myself to inductively working from a
corpus, fundamental facts about the prefix (e)tareman- and about
negation in Tuparí more broadly will be missed.

A lot of recent scholarship has made major strides towards improving
the methodology of collecting and interpreting acceptability
judgments. The formal semanticists who work on understudied languages
(here I am thinking of Judith Tonhauser, Lisa Matthewson, Ryan
Bochnak, Amy Rose Deal, Scott AnderBois) are extremely careful about
teasing apart utterances that are rejected because of some
morphosyntactic ill-formedness (i.e., ungrammaticality) versus ones
that are rejected because of semantic or pragmatic oddity. The
important point is that such teasing apart can be done, and the
descriptions and analyses that result from this work are richer than
what would result from a methodology that uses corpus examination or
elicitation only.

One more example from Tuparí: this language has an obligatory
witnessed/non-witnessed evidential distinction, but the deictic
orientation of the distinction (to the speaker or to the addressee) is
determined via clause type. There is a nuanced set of interactions
between the evidential morphology and the clause-typing morphology,
and it would have been impossible for me to figure out the basics of
those interactions without relying primarily on conversational data
and discourse context. But I still needed to get some acceptability
judgments to ensure that the picture I'd arrived at wasn't overly
biased by the limitations of my corpus. Finding speakers who were
willing to work with me on those judgments wasn't always easy; a fair
amount of metalinguistic awareness was needed. But it was worth it!
The generalizations that I was able to publish were much more solid
than if I had worked exclusively from corpus data. And the methodology
I learned from the Tonhauser/Matthewson/etc crowd was fundamental to
this work.

The call to work inductively from corpora would have the practical
effect of making certain topics totally inaccessible for research
(control vs raising structures, pied-piping, islands, gaps in
inflectional paradigms, etc) even though large scale acceptability
tasks have shown that these phenomena are "real," i.e., they're not
just in the minds of linguists who are using introspection. Randy's
point that "no other science allows the scientist to make up his or
her own data, and so this is something linguists should give up" is a
straw man argument now that many experimentalist syntacticians use
large-scale acceptability judgments on platforms like Mechanical Turk
to get at speakers' judgments. I think we do a disservice to our
students and to junior scholars if we tell them that the only real
stuff to be studied will be in the corpora that we assemble. Even the
best corpora are finite, whereas L1 speakers' knowledge of their
language is infinitely productive.

— Adam