[Corpora-List] Auto-generation and how to spot it
Lou Burnard
lou.burnard at computing-services.oxford.ac.uk
Mon Nov 13 12:06:52 UTC 2006
"My eyes tell me that there are fabulous talents in every decade,
including this one. You have to remember where these young guys were
picked. You know things are different when there's a press seat
assigned to someone representing lebronjames. Like many sports, you are
going to have writers who are too close to the teams they cover and
writers who aren't."
This is the start of a spam which I (and presumably several thousand
other people) just received. My suspicion is that the text has been
automatically generated from a reasonably large corpus of authentic
email material (in this case, presumably, from some collection of sports
writing). The interesting question for this list is: how do I know it's
artificially generated? I'm guessing that the lack of coherence has
something to do with it, but what are the factors which indicate that?
And how much text would you need to scan before determining that there
was no natural coherence amongst its components?
It's a question that several spam filter makers would probably pay good
money for an answer to.
More information about the Corpora
mailing list