Corpora: Re: Corpus Junk mail

Mon Mar 11 20:09:10 UTC 2002

Dear Cormac,

> I'm planning to write a program that uses statistical methods to identify
> junk e-mail. Does anyone know of a corpus of junk mail that I could use ?

The lingspam corpus consists of messages posted to the linguist list and
spam messages. It has been used as a test bench for spam filtering.

There is a link to it from the publications page of Ion Androutsopoulos
http://www.iit.demokritos.gr/~ionandr/publications.htm

best,

Gosse Bouma

>From the README file:

This directory contains the Ling-Spam corpus, as described in the
paper "An Evaluation of Naive Bayesian Anti-Spam Filtering" by
I. Androutsopoulos, J. Koutsias, K.V. Chandrinos, George Paliouras,
and C.D. Spyropoulos; Proceedings of Workshop on Machine Learning
in the New Information Age, 11th European Conference on Machine
Learning, Barcelona, Spain, 2000.

There are four subdirectories, corresponding to four versions of
the corpus:

bare: Lemmatiser disabled, stop-list disabled.
lemm: Lemmatiser enabled, stop-list disabled.
lemm_stop: Lemmatiser enabled, stop-list enabled.
stop: Lemmatiser disabled, stop-list enabled.

Each one of these 4 directories contains 10 subdirectories (part1,
..., part10). These correspond to the 10 partitions of the corpus
that were used in the 10-fold experiments. In each repetition, one
part was reserved for testing and the other 9 were used for training.

Each one of the 10 subdirectories contains both spam and legitimate
messages, one message in each file. Files whose names have the form
spmsg*.txt are spam messages. All other files are legitimate messages.

--
Gosse Bouma, Alfa-informatica, RUG, Postbus 716, 9700 AS Groningen
gosse at let.rug.nl      tel. +31-50-3635937      fax  +31-50-3636855