The Internet Corpora

Wed Apr 4 22:05:31 UTC 2001

There's a group that specializes in digital collections, esp. of the
internet, called The Internet Archive. See www.archive.org. It solicits
research projects that want to use the "library"; I don't think any
linguists have taken advantage of it.  Having a stable collection all in one
place has several advantages: Most search engines only cover 15- 40% of the
current net. This collection is more comprehensive and goes back several
years.  It's also all in one place where you can search it, so you don't
need to rely on a commercial search engine's results.

________________
John Mark Agosta       Software Engineer
     408 982-2000      Edify, division of S1 Corp.
mob) 650 465-4707      2840 San Tomas Expy.
did) 408 486-1711      Santa Clara, CA 95051
jagosta at edify.com
oiuiuiuio

-----Original Message-----
From: Detmar Meurers [mailto:dm at sfs.nphil.uni-tuebingen.de]
Sent: Wednesday, April 04, 2001 9:59 AM
To: hpsg-l at lists.Stanford.EDU
Subject: Re: The Internet Corpora

> Hopefully the increased availability of corpora will eventually lead
> to a shift in argument paradigmes away from the +-grammatical to
> (statistically measured) typicality or strangeness.
>
> Personally I'd rather trust the web as a corpus than the linguistic
> intuitions of 20 linguists. My guess is that the picture will also
> be much clearer when using the web ... ;)
>
> So, Tibor, if you really want to know, run your options through a
> search engine (make sure it doesn't normalize/tokenize/stem), count
> the examples and see if any of them are so much in the minority that
> we can throw them out as irrelevant without feeling too bad about it.
>
> Cheers - Uli

It is this kind of argumentation that disqualified the use of corpora
in theoretical linguistic circles for quite some time now - and there
were good reasons for that. Clearly there is a difference between a
speech or typing error and a rare occurrence of a construction. And
not finding an occurrence of a construction in a corpus has very
little to say about whether that construction is in principle possible
- and as theoretical linguists we are interested in that general
question and not in counting beans in accidental language use.

But the availability of electronic corpora, annotation and search
tools does provide fascinating opportunities for theoretical
linguistics totally unrelated to bean counting: It can give access to
theoretically relevant linguistic data under a wide range of known and
unknown morphological, syntactic and semantic parameters, including
information on the increasingly relevant notion of context.

In addition to the enormous benefit of having a richer data set to
build theories on (and validate old ones), this also helps empirically
ground theoretical linguistics in a second way: In order to find
relevant examples in a corpus, one is forced to translate the
linguistic terminology used to describe constructions of theoretical
interest to linguistic properties which can actually be observed and
thus be part of the annotation of a corpus. For some discussion of
this topic and concrete examples, check out the slides at
http://www.ling.ohio-state.edu/~dm/slides/osu-corpus-talk.pdf or .ps

Lieben Gruss,
Detmar

--
Detmar Meurers                              Fax: Int + 614 292-8833
The Ohio State University                   Tel: Int + 641 292-0461
Department of Linguistics            E-Mail: dm at ling.ohio-state.edu
1712 Neil Avenue, Oxley Hall    Homepage:
Columbus OH 43210-1298, USA     http://www.ling.ohio-state.edu/~dm/