The Internet Corpora

Detmar Meurers dm at sfs.nphil.uni-tuebingen.de
Wed Apr 4 16:59:10 UTC 2001


> Hopefully the increased availability of corpora will eventually lead
> to a shift in argument paradigmes away from the +-grammatical to
> (statistically measured) typicality or strangeness.
>
> Personally I'd rather trust the web as a corpus than the linguistic
> intuitions of 20 linguists. My guess is that the picture will also
> be much clearer when using the web ... ;)
>
> So, Tibor, if you really want to know, run your options through a
> search engine (make sure it doesn't normalize/tokenize/stem), count
> the examples and see if any of them are so much in the minority that
> we can throw them out as irrelevant without feeling too bad about it.
>
> Cheers - Uli

It is this kind of argumentation that disqualified the use of corpora
in theoretical linguistic circles for quite some time now - and there
were good reasons for that. Clearly there is a difference between a
speech or typing error and a rare occurrence of a construction. And
not finding an occurrence of a construction in a corpus has very
little to say about whether that construction is in principle possible
- and as theoretical linguists we are interested in that general
question and not in counting beans in accidental language use.

But the availability of electronic corpora, annotation and search
tools does provide fascinating opportunities for theoretical
linguistics totally unrelated to bean counting: It can give access to
theoretically relevant linguistic data under a wide range of known and
unknown morphological, syntactic and semantic parameters, including
information on the increasingly relevant notion of context.

In addition to the enormous benefit of having a richer data set to
build theories on (and validate old ones), this also helps empirically
ground theoretical linguistics in a second way: In order to find
relevant examples in a corpus, one is forced to translate the
linguistic terminology used to describe constructions of theoretical
interest to linguistic properties which can actually be observed and
thus be part of the annotation of a corpus. For some discussion of
this topic and concrete examples, check out the slides at
http://www.ling.ohio-state.edu/~dm/slides/osu-corpus-talk.pdf or .ps

Lieben Gruss,
Detmar

--
Detmar Meurers                              Fax: Int + 614 292-8833
The Ohio State University                   Tel: Int + 641 292-0461
Department of Linguistics            E-Mail: dm at ling.ohio-state.edu
1712 Neil Avenue, Oxley Hall    Homepage:
Columbus OH 43210-1298, USA     http://www.ling.ohio-state.edu/~dm/



More information about the HPSG-L mailing list