The Internet Corpora
Dick Hudson
dick at linguistics.ucl.ac.uk
Wed Apr 4 08:14:26 UTC 2001
>I imagine that using the web as a corpus would be about the
>same as using any other collection of texts as a corpus, in
>that one has to be sure to keep in mind the source of the
>data and various possible "impurities", such as inclusion of
>diverse language varieties (this could be good or bad), degree
>of editing (again good or bad, considering both typos and
>lack of spontaneousness), and lack of representativeness (what
>genres are really included on the web, and in what proportions?).
## But a lot of WWW material is produced in English by non-native speakers
so it's a *very* mixed corpus compared with others.
Richard (= Dick) Hudson
Phonetics and Linguistics, University College London,
Gower Street, London WC1E 6BT.
+44(0)20 7679 3152; fax +44(0)20 7383 4108;
http://www.phon.ucl.ac.uk/home/dick/home.htm
More information about the HPSG-L
mailing list