[Corpora-List] Auto-generation and how to spot it

Sravana Reddy sravana.reddy at gmail.com
Mon Nov 13 23:04:46 UTC 2006


Believe it or not, that spam was _not_ artifically generated! At least at
the sentence level. All the individual sentences are from
http://kfba.net/Forums/. The only randomness there is the selection and
order of the sentences.

That aside, your question is very interesting. I woud guess that an
artifically generated text has greater entropy than a human generated
sample. So, perhaps you could train a reasonable order Markov model on some
specialized corpus (sports discussion, in this case), and measure the
redundancy of the test sample against that.

Sravana

On 11/13/06, Lou Burnard <lou.burnard at computing-services.oxford.ac.uk >
wrote:
>
> "My eyes tell me that there are fabulous talents in every decade,
> including this one. You have to remember where these young guys were
> picked. You know things  are different when there's a press seat
> assigned to someone representing lebronjames. Like many sports, you are
> going to have writers who are too close  to the teams they cover and
> writers who aren't."
>
>
> This is the start of a spam which I (and presumably several thousand
> other people) just received. My suspicion is that the text has been
> automatically generated from a reasonably large corpus of authentic
> email material (in this case, presumably, from some collection of sports
> writing). The interesting question for this list is: how do I know it's
> artificially generated? I'm guessing that the lack of coherence has
> something to do with it, but what are the factors which indicate that?
> And how much text would you need to scan before determining that there
> was no natural coherence amongst its components?
>
> It's a question that several spam filter makers would probably pay good
> money for an answer to.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20061113/3cf80ce6/attachment.htm>


More information about the Corpora mailing list