[Corpora-List] Auto-generation and how to spot it

Diana Maynard d.maynard at dcs.shef.ac.uk
Mon Nov 13 12:31:58 UTC 2006


In general I've noticed that the subject header bears no correlation at 
all to the email content, which could be a useful indicator. Although of 
course, genuine emails often suffer from this problem when people reply 
to messages and gradually change tack without changing the subject 
header. In this case though, you generally get some pasting of the 
message to which they're replying (I've never yet seen that on a spam 
mail - I assumed because the content of the spam is pasted from a web 
corpus rather than an email corpus).
Diana


Lou Burnard wrote:
> "My eyes tell me that there are fabulous talents in every decade, 
> including this one. You have to remember where these young guys were 
> picked. You know things  are different when there's a press seat 
> assigned to someone representing lebronjames. Like many sports, you 
> are going to have writers who are too close  to the teams they cover 
> and writers who aren't."
>
>
> This is the start of a spam which I (and presumably several thousand 
> other people) just received. My suspicion is that the text has been 
> automatically generated from a reasonably large corpus of authentic 
> email material (in this case, presumably, from some collection of 
> sports writing). The interesting question for this list is: how do I 
> know it's artificially generated? I'm guessing that the lack of 
> coherence has something to do with it, but what are the factors which 
> indicate that? And how much text would you need to scan before 
> determining that there was no natural coherence amongst its components?
>
> It's a question that several spam filter makers would probably pay 
> good money for an answer to.
>
>



More information about the Corpora mailing list