[Corpora-List] Deviations in language models on the web

Tue Nov 16 11:44:00 UTC 2010

Building LMs from Web data is actually a very tricky business.  In my
experience it is next-to-impossible to get anything useful out of it
(and by "useful", I mean improvements for machine translation).  The
largest Web LM I have built is a five-gram over 144 billion tokens
(taken from the CLUEWEB release).  This is after  de-duplicating the
data and removing spam from it.

Now, the question of why Web Data (eg the Google Ngram release, my
CLUEWEB data) doesn't seem to work for tasks is interesting.  I would
guess that this is a mixture of the raw frequencies of data not being
like any real edited text (this stands in contradiction to early
claims of "The Web as Corpus"), the possibility that even 144 Billion
Tokens (etc) is not enough volume to compensate for the massive amount
of garbage or the idea that Web data is just too out-of-domain for
typical translation tasks.

Good luck

Miles
-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora