[Corpora-List] Auto-generation and how to spot it

Yorick Wilks yorick at dcs.shef.ac.uk
Mon Nov 13 12:30:58 UTC 2006


Ive had more than one student do a project on spams generated from 
parts of coherent text--i.e. the job was to detect incoherence (usually 
by topic/vocab shift across sentence boundaries larger than control 
text)---such methods usually work well at the 95% level and could 
easily be put in filters if anyone wanted to.
Yorick Wilks


On 13 Nov 2006, at 12:06, Lou Burnard wrote:

> "My eyes tell me that there are fabulous talents in every decade, 
> including this one. You have to remember where these young guys were 
> picked. You know things  are different when there's a press seat 
> assigned to someone representing lebronjames. Like many sports, you 
> are going to have writers who are too close  to the teams they cover 
> and writers who aren't."
>
>
> This is the start of a spam which I (and presumably several thousand 
> other people) just received. My suspicion is that the text has been 
> automatically generated from a reasonably large corpus of authentic 
> email material (in this case, presumably, from some collection of 
> sports writing). The interesting question for this list is: how do I 
> know it's artificially generated? I'm guessing that the lack of 
> coherence has something to do with it, but what are the factors which 
> indicate that? And how much text would you need to scan before 
> determining that there was no natural coherence amongst its 
> components?
>
> It's a question that several spam filter makers would probably pay 
> good money for an answer to.
>
>



More information about the Corpora mailing list