[Corpora-List] Deviations in language models on the web

Justin Washtell lec3jrw at leeds.ac.uk
Tue Nov 16 12:34:59 UTC 2010


Hi Serge,

I can think of one or two half-hearted angles of attack, but nothing off the top of my head which couldn't readily be out-foxed by the very next wave of link-spammers. Indeed, any half-decent language models we do develop, are ripe for exploitation directly by the spammers. Given that very fundamental trait of language: its generative capacity, I am inclined to think that the spammers have the upper hand in this one. It's a bit like a war between viruses and anti-virus software, except in a world where a "legitimate" program is largely defined by the fact that it self-replicates and self-obfuscates. My initial suspicion is therefore that this is a genuinely hard - borderline impossible - problem. Mind you, that's exactly what makes it interesting... so I shall give it some more thought :-)

Justin Washtell
University of Leeds

________________________________________
From: corpora-bounces at uib.no [corpora-bounces at uib.no] On Behalf Of Serge Sharoff [s.sharoff at leeds.ac.uk]
Sent: 16 November 2010 09:12
To: corpora at uib.no
Subject: [Corpora-List] Deviations in language models on the web

Dear all,

in doing webcrawls for linguistic purposes, I recently came across an
approach to link spamming or SEO optimisation that involves taking
sentences from a large range of texts (mostly out-of-copyright fiction),
mixing the sentences randomly, injecting the name of a product (or other
keywords) and creating thousands of webpages.

The intent is probably to fool search engines into thinking these are
product reviews or descriptions, but the implication for linguistics is
that we get polluted language models, in which mobile phones collocate
with horse drawn carriages.

SEO-enhanced pages I came across in the past contained random word lists
with keywords injected.  It was possible to deal with such cases by
n-gram filtering.  However, this simple trick doesn't work any longer,
as the sentences are to a very large extent entirely grammatical.

Any experience from others and suggestions on how to deal with this
phenomenon.

Best,
Serge


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list