The Internet Corpora

Wed Apr 4 08:14:26 UTC 2001

	
>I imagine that using the web as a corpus would be about the
>same as using any other collection of texts as a corpus, in
>that one has to be sure to keep in mind the source of the
>data and various possible "impurities", such as inclusion of
>diverse language varieties (this could be good or bad), degree
>of editing (again good or bad, considering both typos and
>lack of spontaneousness), and lack of representativeness (what
>genres are really included on the web, and in what proportions?).
## But a lot of WWW material is produced in English by non-native speakers
so it's a *very* mixed corpus compared with others.


Richard (= Dick) Hudson

Phonetics and Linguistics, University College London,
Gower Street, London WC1E  6BT.
+44(0)20 7679 3152; fax +44(0)20 7383 4108;
http://www.phon.ucl.ac.uk/home/dick/home.htm