[Corpora-List] Auto-generation and how to spot it
Diana Maynard
d.maynard at dcs.shef.ac.uk
Mon Nov 13 12:31:58 UTC 2006
In general I've noticed that the subject header bears no correlation at
all to the email content, which could be a useful indicator. Although of
course, genuine emails often suffer from this problem when people reply
to messages and gradually change tack without changing the subject
header. In this case though, you generally get some pasting of the
message to which they're replying (I've never yet seen that on a spam
mail - I assumed because the content of the spam is pasted from a web
corpus rather than an email corpus).
Diana
Lou Burnard wrote:
> "My eyes tell me that there are fabulous talents in every decade,
> including this one. You have to remember where these young guys were
> picked. You know things are different when there's a press seat
> assigned to someone representing lebronjames. Like many sports, you
> are going to have writers who are too close to the teams they cover
> and writers who aren't."
>
>
> This is the start of a spam which I (and presumably several thousand
> other people) just received. My suspicion is that the text has been
> automatically generated from a reasonably large corpus of authentic
> email material (in this case, presumably, from some collection of
> sports writing). The interesting question for this list is: how do I
> know it's artificially generated? I'm guessing that the lack of
> coherence has something to do with it, but what are the factors which
> indicate that? And how much text would you need to scan before
> determining that there was no natural coherence amongst its components?
>
> It's a question that several spam filter makers would probably pay
> good money for an answer to.
>
>
More information about the Corpora
mailing list