[Corpora-List] Auto-generation and how to spot it

Mon Nov 13 12:51:59 UTC 2006

I was quite shocked yesterday to receive, within seconds of having
viewed a video on youtube, a spam email with a subject header obviously
derived from the video I had just watched. No doubt the answer is to
install some sort of firewall (this was at home on my more or less open
broadband connection).

Sorry it's off topic a bit, but if like me you sometimes view your email
before it gets to the spam filter, it is quite shocking just how much
there is, and quite reassuring how much of it gets junked.

Harold Somers  

> -----Original Message-----
> From: owner-corpora at lists.uib.no 
> [mailto:owner-corpora at lists.uib.no] On Behalf Of Diana Maynard
> Sent: 13 November 2006 12:32
> To: Lou Burnard
> Cc: corpora at lists.uib.no
> Subject: Re: [Corpora-List] Auto-generation and how to spot it
> 
> In general I've noticed that the subject header bears no 
> correlation at all to the email content, which could be a 
> useful indicator. Although of course, genuine emails often 
> suffer from this problem when people reply to messages and 
> gradually change tack without changing the subject header. In 
> this case though, you generally get some pasting of the 
> message to which they're replying (I've never yet seen that 
> on a spam mail - I assumed because the content of the spam is 
> pasted from a web corpus rather than an email corpus).
> Diana
> 
> 
> Lou Burnard wrote:
> > "My eyes tell me that there are fabulous talents in every decade, 
> > including this one. You have to remember where these young 
> guys were 
> > picked. You know things  are different when there's a press seat 
> > assigned to someone representing lebronjames. Like many sports, you 
> > are going to have writers who are too close  to the teams 
> they cover 
> > and writers who aren't."
> >
> >
> > This is the start of a spam which I (and presumably several 
> thousand 
> > other people) just received. My suspicion is that the text has been 
> > automatically generated from a reasonably large corpus of authentic 
> > email material (in this case, presumably, from some collection of 
> > sports writing). The interesting question for this list is: 
> how do I 
> > know it's artificially generated? I'm guessing that the lack of 
> > coherence has something to do with it, but what are the 
> factors which 
> > indicate that? And how much text would you need to scan before 
> > determining that there was no natural coherence amongst its 
> components?
> >
> > It's a question that several spam filter makers would probably pay 
> > good money for an answer to.
> >
> >
> 
>