[Corpora-List] Query on the use of Google for corpus research

Marco Baroni baroni at sslmit.unibo.it
Tue May 31 22:56:13 UTC 2005


> But then again, why not go simply to UPenn and purchase some
> license for English Gigaword plus some additional tens of millions
> words corpora from LDC?

For example because I'm also interested in 1 billion words of Italian,
German and Japanese?  Or because I think that the web can give us a more
varied picture of a language than a newswire corpus? But more in general
because I think that, with all the linguistic data available out there on
the web (probably orders of magnitude more data than the whole LDC and
ELDA catalogues put together), it is a good idea to develop/gather/share
tools and procedures to get them in "corpus format"...

Which of course does not mean that prefab corpora do not have their
function, as well.

Regards,

Marco



More information about the Corpora mailing list