[Corpora-List] Do we still need language corpora?
Angus B. Grieve-Smith
grvsmth at panix.com
Fri Feb 4 04:38:50 UTC 2011
On 2/3/2011 11:43 AM, Martin Wynne wrote:
> Do we still need language corpora? Is the selection and preparation of
> the carefully crafted corpus a waste of time and money these days when
> large amounts of language data are freely available on the web?
> [...]
> Discussion of the proposition is also welcome on this list!
>
Okay, I'll bite. There are at least two possible interpretations
to your question, and I'm not clear which one you meant.
1. Do we still need to package and distribute corpora (often for money,
often with restricted web interfaces) when just about anyone can select
and prepare their own corpus by downloading a bunch of documents from
the Web?
I would say only if we need some kind of compensation to motivate
people to select and prepare corpora. If there are other motivations
that will give us quality corpora, then no.
2. Do we still need to select and prepare corpora, or can we just run a
bunch of Google searches?
I would say it depends whether you want to show existence or
prevalence.
If you just want to be able to say, "Someone once wrote, 'I used to
could'," then you can Google it and you've got about 1,890 results.
If you want to be able to say, "People born and raised in West
Virginia use 'I used to could' more often than 'I used to be able to',"
or more often than they used to, you need at least one representative
sample of the language produced by the people of West Virginia.
The ability to run a Google search on billions (trillions?) of
documents doesn't give you a reliable sense of the distribution of
constructions in the language, just as the ability to get thousands of
responses to a survey posted on a website doesn't give you a reliable
sense of the distribution of opinions in the population.
I hope that helps.
--
-Angus B. Grieve-Smith
Saint John's University
grvsmth at panix.com
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list