The Internet Corpora

Luis Casillas casillas at stanford.edu
Thu Apr 5 00:39:58 UTC 2001


On Tue, Apr 03, 2001 at 03:05:01PM -0800, James A. Crippen wrote:

> Could the World Wide Web and all the text in various languages available
> on it be considered legitimate forms of corpora?  Granted there are a
> large number of spelling mistakes (try searching for 'trasnlation' or
> 'lingiust'), and a large number of obvious grammatical errors (search for
> 'if he love me' for example), but the extant text available online
> certainly exceeds the size of any corpus for a major language by orders of
> magnitude.
>
> In the future, will we start seeing explicit WWW search results in
> papers?  This could easily become a major point of argument...

I think I've seen such things already at informal talks, but I don't
quite remember a particular instance.  I do know somebody who in his
talks frequently cites examples from email lists he's subscribed to.

In any case, I have done some minor experimentation in using Google to
do searches to explore some quite marginal possibilities of
derivational morphology, just because the amount of text in the net is
such that it's the only hope to get the forms in question to show up.

The most useful thing I've gotten out of it is some suggestive
frequency results on a hypothesis of mine about a difficulty in
forming diminutives for a certain class of last names in Spanish (yes,
this is *very* arcane stuff).  The results were far from final, and
although they told me that my hypothesis is not crazy, I still need to
work with actual speakers.

Common problems:

  * Frequently, the options to limit search results to a language
    don't work perfectly.  I've had "English only" searches that
    return results mostly in French, and endless numbers of Italian
    results for Spanish searches-- sometimes these can make a
    particular search useless.

  * Many texts appear at several URLs.  This inflates the number of
    matches for the form in question.  But what to make of the
    inflated numbers is something that must be determined in a
    case-by-case basis.  A limiting case I've run into is finding
    the form I want in the transcription of a chorus of a popular
    song, with popular as in "millions of people have sung this line".
    So, looked one way, you have a very small net amount of text, but
    looked another, you have millions of people with the form
    somewhere in their heads.

  * The last example points to the bigger problem: there's a lot of
    unpredictable case-by-case reasoning in interpreting a large
    proportion of the search results.  More than in conventional
    corpora, IMHO, given the uncontrolled nature of the net.

One thing I haven't tried is to use web searches for syntactic
questions, though.

--
Luis Casillas
Department of Linguistics
Stanford University



More information about the HPSG-L mailing list