[Corpora-List] Do we still need language corpora?

Angus B. Grieve-Smith grvsmth at panix.com
Fri Feb 4 04:38:50 UTC 2011


On 2/3/2011 11:43 AM, Martin Wynne wrote:
> Do we still need language corpora? Is the selection and preparation of 
> the carefully crafted corpus a waste of time and money these days when 
> large amounts of language data are freely available on the web?
> [...]
> Discussion of the proposition is also welcome on this list!
>

     Okay, I'll bite.  There are at least two possible interpretations 
to your question, and I'm not clear which one you meant.

1. Do we still need to package and distribute corpora (often for money, 
often with restricted web interfaces) when just about anyone can select 
and prepare their own corpus by downloading a bunch of documents from 
the Web?

     I would say only if we need some kind of compensation to motivate 
people to select and prepare corpora.  If there are other motivations 
that will give us quality corpora, then no.

2. Do we still need to select and prepare corpora, or can we just run a 
bunch of Google searches?

     I would say it depends whether you want to show existence or 
prevalence.

     If you just want to be able to say, "Someone once wrote, 'I used to 
could'," then you can Google it and you've got about 1,890 results.

     If you want to be able to say, "People born and raised in West 
Virginia use 'I used to could' more often than 'I used to be able to'," 
or more often than they used to, you need at least one representative 
sample of the language produced by the people of West Virginia.

     The ability to run a Google search on billions (trillions?) of 
documents doesn't give you a reliable sense of the distribution of 
constructions in the language, just as the ability to get thousands of 
responses to a survey posted on a website doesn't give you a reliable 
sense of the distribution of opinions in the population.

     I hope that helps.

-- 
				-Angus B. Grieve-Smith
				Saint John's University
				grvsmth at panix.com


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list