[Corpora-List] Deviations in language models on the web

Serge Sharoff s.sharoff at leeds.ac.uk
Tue Nov 16 14:12:03 UTC 2010


this is precisely an argument in favour of doing WaC carefully, without
being blinded by sheer numbers.  The Google ngram release is known to
suffer from problems in cleaning, just remember the "trolls ambushed the
dwarfs" example, discussed in a couple of NLP blogs recently.  Getting
rid of link spammers, removing boilerplate, doing some kind of
domain/genre classification all contributes to producing a reasonable
corpus, definitely better than what the BNC can provide.  As an example,
for the kind of research I was doing ukWac with its 2 billion words was
more useful than the 1TB Google corpus, but I heard that Google is
planning an updated release, so this is something worth watching.

Serge

On Tue, 2010-11-16 at 11:44 +0000, Miles Osborne wrote:
> Building LMs from Web data is actually a very tricky business.  In my
> experience it is next-to-impossible to get anything useful out of it
> (and by "useful", I mean improvements for machine translation).  The
> largest Web LM I have built is a five-gram over 144 billion tokens
> (taken from the CLUEWEB release).  This is after  de-duplicating the
> data and removing spam from it.
> 
> Now, the question of why Web Data (eg the Google Ngram release, my
> CLUEWEB data) doesn't seem to work for tasks is interesting.  I would
> guess that this is a mixture of the raw frequencies of data not being
> like any real edited text (this stands in contradiction to early
> claims of "The Web as Corpus"), the possibility that even 144 Billion
> Tokens (etc) is not enough volume to compensate for the massive amount
> of garbage or the idea that Web data is just too out-of-domain for
> typical translation tasks.
> 
> Good luck
> 
> Miles



_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list