[Corpora-List] Auto-generation and how to spot it

Lou Burnard lou.burnard at computing-services.oxford.ac.uk
Mon Nov 13 12:06:52 UTC 2006


"My eyes tell me that there are fabulous talents in every decade, 
including this one. You have to remember where these young guys were 
picked. You know things  are different when there's a press seat 
assigned to someone representing lebronjames. Like many sports, you are 
going to have writers who are too close  to the teams they cover and 
writers who aren't."


This is the start of a spam which I (and presumably several thousand 
other people) just received. My suspicion is that the text has been 
automatically generated from a reasonably large corpus of authentic 
email material (in this case, presumably, from some collection of sports 
writing). The interesting question for this list is: how do I know it's 
artificially generated? I'm guessing that the lack of coherence has 
something to do with it, but what are the factors which indicate that? 
And how much text would you need to scan before determining that there 
was no natural coherence amongst its components?

It's a question that several spam filter makers would probably pay good 
money for an answer to.



More information about the Corpora mailing list