[Corpora-List] Auto-generation and how to spot it
Yorick Wilks
yorick at dcs.shef.ac.uk
Mon Nov 13 12:30:58 UTC 2006
Ive had more than one student do a project on spams generated from
parts of coherent text--i.e. the job was to detect incoherence (usually
by topic/vocab shift across sentence boundaries larger than control
text)---such methods usually work well at the 95% level and could
easily be put in filters if anyone wanted to.
Yorick Wilks
On 13 Nov 2006, at 12:06, Lou Burnard wrote:
> "My eyes tell me that there are fabulous talents in every decade,
> including this one. You have to remember where these young guys were
> picked. You know things are different when there's a press seat
> assigned to someone representing lebronjames. Like many sports, you
> are going to have writers who are too close to the teams they cover
> and writers who aren't."
>
>
> This is the start of a spam which I (and presumably several thousand
> other people) just received. My suspicion is that the text has been
> automatically generated from a reasonably large corpus of authentic
> email material (in this case, presumably, from some collection of
> sports writing). The interesting question for this list is: how do I
> know it's artificially generated? I'm guessing that the lack of
> coherence has something to do with it, but what are the factors which
> indicate that? And how much text would you need to scan before
> determining that there was no natural coherence amongst its
> components?
>
> It's a question that several spam filter makers would probably pay
> good money for an answer to.
>
>
More information about the Corpora
mailing list