[Corpora-List] Deviations in language models on the web

Serge Sharoff s.sharoff at leeds.ac.uk
Tue Nov 16 09:12:02 UTC 2010


Dear all,

in doing webcrawls for linguistic purposes, I recently came across an
approach to link spamming or SEO optimisation that involves taking
sentences from a large range of texts (mostly out-of-copyright fiction),
mixing the sentences randomly, injecting the name of a product (or other
keywords) and creating thousands of webpages.  

The intent is probably to fool search engines into thinking these are
product reviews or descriptions, but the implication for linguistics is
that we get polluted language models, in which mobile phones collocate
with horse drawn carriages.

SEO-enhanced pages I came across in the past contained random word lists
with keywords injected.  It was possible to deal with such cases by
n-gram filtering.  However, this simple trick doesn't work any longer,
as the sentences are to a very large extent entirely grammatical.

Any experience from others and suggestions on how to deal with this
phenomenon.

Best,
Serge


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list