[Corpora-List] Query on the use of Google for corpus research

Alexander Schutz goalscoringsuperstarhero at gmail.com
Tue May 31 22:38:24 UTC 2005


I see your point in everything you are saying in case you really 
(and desperately) want to compile this billions of words corpus 
from the web.
But then again, why not go simply to UPenn and purchase some 
license for English Gigaword plus some additional tens of millions
words corpora from LDC? It's all nicely marked up and you don't
have to mess with all those crawling and postprocessing problems 
at all, not to mention storage.

Cheers,
Alex

On 5/31/05, Marco Baroni <baroni at sslmit.unibo.it> wrote:
> In my experience, adding and changing samples indefinitely until I have
> about 1 billion words of web-data with the characteristics I need turns
> out to be a pretty difficult thing to do... if you can suggest a procedure
> to do this in an easy way, I (and, I suspect, "most corpus linguists")
> would be very grateful.
>



More information about the Corpora mailing list