The Internet Corpora
Ulrich Germann
germann at ISI.EDU
Wed Apr 4 01:32:33 UTC 2001
James, the point is to use corpora right.
You cannot apply the rule "once seen, so it 'can be said'". You have
to take a statistical approach and, as John Sinclair once put
it in the title of a paper "first throw away your evidence"
(in: G. Leitner (Ed.) The English Reference Grammar. Language and
Linguistics, Writers and Readers. Tuebingen 1986).
Let's put the first of your 'linguist/lingiust' to the test with
google.com:
linguist: ca. 191,000 hits
lingiust: ca. 9 hits
No typo here, these are the figures: nine for lingiust,
one hundred and ninety-one thousand for linguist.
Translation/trasnlation: four million one hundred and thirty thousand
versus one hundred and ninety-one.
The "if he love(s) me" is very tricky indeed, as broken hearts seem to tend
to affect people's typing and spelling skills on emotional
support bulletin boards and soul advice columns (237/1440). But it
also tells us something: when talking about linguists, writers seem
to be more focussed and concentrated than when discussing their love
life.
Hopefully the increased availability of corpora will eventually lead
to a shift in argument paradigmes away from the +-grammatical to
(statistically measured) typicality or strangeness.
Personally I'd rather trust the web as a corpus than the linguistic
intuitions of 20 linguists. My guess is that the picture will also
be much clearer when using the web ... ;)
So, Tibor, if you really want to know, run your options through a
search engine (make sure it doesn't normalize/tokenize/stem), count
the examples and see if any of them are so much in the minority that
we can throw them out as irrelevant without feeling too bad about it.
Cheers - Uli
"James A. Crippen" wrote:
>
> Could the World Wide Web and all the text in various languages available
> on it be considered legitimate forms of corpora? Granted there are a
> large number of spelling mistakes (try searching for 'trasnlation' or
> 'lingiust'), and a large number of obvious grammatical errors (search for
> 'if he love me' for example), but the extant text available online
> certainly exceeds the size of any corpus for a major language by orders of
> magnitude.
>
> In the future, will we start seeing explicit WWW search results in
> papers? This could easily become a major point of argument...
>
> 'james
>
> --
> James A. Crippen <james at unlambda.com> ,-./-. Anchorage, Alaska,
> Lambda Unlimited: Recursion 'R' Us | |/ | USA, 61.2069 N, 149.766 W,
> Y = \f.(\x.f(xx)) (\x.f(xx)) | |\ | Earth, Sol System,
> Y(F) = F(Y(F)) \_,-_/ Milky Way.
--
======================================================================
Ulrich Germann Tel. (310) 448-8430
USC Information Sciences Institute Fax. (310) 822 0751
4676 Admiralty Way, Suite 1001 email:germann at isi.edu
Marina del Rey, CA 90292 http://www.isi.edu/~germann
More information about the HPSG-L
mailing list