[Corpora-List] Deviations in language models on the web

Tue Nov 16 14:54:00 UTC 2010

I believe language models need to take structure beyond the sentence
into account. Then it would be fairly obvious that you're looking at a
list of sentences rather than a text; just as we can already
distinguish between a list of words and a proper sentence.

The problem, then, is how to push language models up one level...

Oliver

On 16 November 2010 12:34, Justin Washtell <lec3jrw at leeds.ac.uk> wrote:
> Hi Serge,
>
> I can think of one or two half-hearted angles of attack, but nothing off the top of my head which couldn't readily be out-foxed by the very next wave of link-spammers. Indeed, any half-decent language models we do develop, are ripe for exploitation directly by the spammers. Given that very fundamental trait of language: its generative capacity, I am inclined to think that the spammers have the upper hand in this one. It's a bit like a war between viruses and anti-virus software, except in a world where a "legitimate" program is largely defined by the fact that it self-replicates and self-obfuscates. My initial suspicion is therefore that this is a genuinely hard - borderline impossible - problem. Mind you, that's exactly what makes it interesting... so I shall give it some more thought :-)
>
> Justin Washtell
> University of Leeds
>
> ________________________________________
> From: corpora-bounces at uib.no [corpora-bounces at uib.no] On Behalf Of Serge Sharoff [s.sharoff at leeds.ac.uk]
> Sent: 16 November 2010 09:12
> To: corpora at uib.no
> Subject: [Corpora-List] Deviations in language models on the web
>
> Dear all,
>
> in doing webcrawls for linguistic purposes, I recently came across an
> approach to link spamming or SEO optimisation that involves taking
> sentences from a large range of texts (mostly out-of-copyright fiction),
> mixing the sentences randomly, injecting the name of a product (or other
> keywords) and creating thousands of webpages.
>
> The intent is probably to fool search engines into thinking these are
> product reviews or descriptions, but the implication for linguistics is
> that we get polluted language models, in which mobile phones collocate
> with horse drawn carriages.
>
> SEO-enhanced pages I came across in the past contained random word lists
> with keywords injected.  It was possible to deal with such cases by
> n-gram filtering.  However, this simple trick doesn't work any longer,
> as the sentences are to a very large extent entirely grammatical.
>
> Any experience from others and suggestions on how to deal with this
> phenomenon.
>
> Best,
> Serge
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

-- 
Dr Oliver Mason
Technical Director of the Centre for Corpus Research
Head of Postgraduate Studies (Doctoral Research)
School of English, Drama, and ACS
The University of Birmingham
Birmingham B15 2TT

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora