[Corpora-List] Deviations in language models on the web

Keith Trnka trnka at cis.udel.edu
Wed Nov 17 01:17:49 UTC 2010


In the context of the original problem, if we had a way to rate documents that accounted for higher-level structure, the spammer would pick a sequence of sentences to maximize it.  When the model is good enough to account for things like discourse structure, cohesion, etc., at some point the sequence of sentences selected by the spammer either becomes A) useful to a human (in which case it's not exactly spam) or B) very close to a copy of a real document (in which case it may struggle with duplicate-detection methods).

With respect to the original problem, task-specific identifiers may be easier to apply, such as blacklists for hosts, username/host pairs, and characteristics of the links (are they all grouped in the html?  is the link/word ratio unlike natural text?).  There might be too many hostnames for a straight blacklist, so you might have to generalize hosts into categories.

A hybrid approach might try and determine the number of links that are cohesive/related to the text you're checking and maybe learn a threshold to tell the difference.

keith

On Nov 16, 2010, at 6:20 PM, Oliver Mason wrote:

> I would think that if a comprehensive model of text structure can at
> some point be created, then it would not be possible (or at least very
> hard) to generate 'random' texts, as they'd be incoherent. Generating
> a 'proper' text would require a lot of input (or content) that would
> have to be used for language generation. Though my usage of 'language
> model on text level' here is probably different from what some other
> people in this thread understand by it! And I think we're still years
> away from such a model.
> 
> Best,
> Oliver
> 
> On 16 November 2010 21:55, Adam Kilgarriff <adam at lexmasterclass.com> wrote:
>> Oliver,
>> there are easier ways that taking lg models beyond the sentence. (And even
>> if we did, the  clever spammer could use those lg models in 'generation'
>> mode so keep ahead of us.)  One is to piggyback on the large amounts of work
>> that Google and Bing do to stay ahead of the spammers, eg by using BootCaT.
>>  They are putting lots of effort into not giving spam as top search hits, so
>> if we use pages that they propose, we avoid most spam
>> 
>> Adam
>> On 16 November 2010 14:54, Oliver Mason <O.Mason at bham.ac.uk> wrote:
>>> 
>>> I believe language models need to take structure beyond the sentence
>>> into account. Then it would be fairly obvious that you're looking at a
>>> list of sentences rather than a text; just as we can already
>>> distinguish between a list of words and a proper sentence.
>>> 
>>> The problem, then, is how to push language models up one level...
>>> 
>>> Oliver
>>> 
>>> On 16 November 2010 12:34, Justin Washtell <lec3jrw at leeds.ac.uk> wrote:
>>>> Hi Serge,
>>>> 
>>>> I can think of one or two half-hearted angles of attack, but nothing off
>>>> the top of my head which couldn't readily be out-foxed by the very next wave
>>>> of link-spammers. Indeed, any half-decent language models we do develop, are
>>>> ripe for exploitation directly by the spammers. Given that very fundamental
>>>> trait of language: its generative capacity, I am inclined to think that the
>>>> spammers have the upper hand in this one. It's a bit like a war between
>>>> viruses and anti-virus software, except in a world where a "legitimate"
>>>> program is largely defined by the fact that it self-replicates and
>>>> self-obfuscates. My initial suspicion is therefore that this is a genuinely
>>>> hard - borderline impossible - problem. Mind you, that's exactly what makes
>>>> it interesting... so I shall give it some more thought :-)
>>>> 
>>>> Justin Washtell
>>>> University of Leeds
>>>> 
>>>> ________________________________________
>>>> From: corpora-bounces at uib.no [corpora-bounces at uib.no] On Behalf Of Serge
>>>> Sharoff [s.sharoff at leeds.ac.uk]
>>>> Sent: 16 November 2010 09:12
>>>> To: corpora at uib.no
>>>> Subject: [Corpora-List] Deviations in language models on the web
>>>> 
>>>> Dear all,
>>>> 
>>>> in doing webcrawls for linguistic purposes, I recently came across an
>>>> approach to link spamming or SEO optimisation that involves taking
>>>> sentences from a large range of texts (mostly out-of-copyright fiction),
>>>> mixing the sentences randomly, injecting the name of a product (or other
>>>> keywords) and creating thousands of webpages.
>>>> 
>>>> The intent is probably to fool search engines into thinking these are
>>>> product reviews or descriptions, but the implication for linguistics is
>>>> that we get polluted language models, in which mobile phones collocate
>>>> with horse drawn carriages.
>>>> 
>>>> SEO-enhanced pages I came across in the past contained random word lists
>>>> with keywords injected.  It was possible to deal with such cases by
>>>> n-gram filtering.  However, this simple trick doesn't work any longer,
>>>> as the sentences are to a very large extent entirely grammatical.
>>>> 
>>>> Any experience from others and suggestions on how to deal with this
>>>> phenomenon.
>>>> 
>>>> Best,
>>>> Serge
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Corpora mailing list
>>>> Corpora at uib.no
>>>> http://mailman.uib.no/listinfo/corpora
>>>> 
>>>> _______________________________________________
>>>> Corpora mailing list
>>>> Corpora at uib.no
>>>> http://mailman.uib.no/listinfo/corpora
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Dr Oliver Mason
>>> Technical Director of the Centre for Corpus Research
>>> Head of Postgraduate Studies (Doctoral Research)
>>> School of English, Drama, and ACS
>>> The University of Birmingham
>>> Birmingham B15 2TT
>>> 
>>> _______________________________________________
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>> 
>> 
>> 
>> --
>> ================================================
>> Adam Kilgarriff
>> http://www.kilgarriff.co.uk
>> Lexical Computing Ltd                   http://www.sketchengine.co.uk
>> Lexicography MasterClass Ltd      http://www.lexmasterclass.com
>> Universities of Leeds and Sussex       adam at lexmasterclass.com
>> ================================================
>> 
> 
> 
> 
> -- 
> Dr Oliver Mason
> Technical Director of the Centre for Corpus Research
> Head of Postgraduate Studies (Doctoral Research)
> School of English, Drama, and ACS
> The University of Birmingham
> Birmingham B15 2TT
> 
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list