The Internet Corpora

James A. Crippen james at UnLambda.COM
Tue Apr 3 23:05:01 UTC 2001


Could the World Wide Web and all the text in various languages available
on it be considered legitimate forms of corpora?  Granted there are a
large number of spelling mistakes (try searching for 'trasnlation' or
'lingiust'), and a large number of obvious grammatical errors (search for
'if he love me' for example), but the extant text available online
certainly exceeds the size of any corpus for a major language by orders of
magnitude.

In the future, will we start seeing explicit WWW search results in
papers?  This could easily become a major point of argument...

'james

--
James A. Crippen <james at unlambda.com> ,-./-.  Anchorage, Alaska,
Lambda Unlimited: Recursion 'R' Us   |  |/  | USA, 61.2069 N, 149.766 W,
Y = \f.(\x.f(xx)) (\x.f(xx))         |  |\  | Earth, Sol System,
Y(F) = F(Y(F))                        \_,-_/  Milky Way.



More information about the HPSG-L mailing list