Corpora: Corpus Junk mail

Jose Maria Gomez Hidalgo jmgomez at dinar.esi.uem.es
Tue Mar 12 09:30:55 UTC 2002


At 16:18 11/03/2002 +0100, you wrote:

>Hi,
>
>I'm planning to write a program that uses statistical methods to identify
>junk e-mail. Does anyone know of a corpus of junk mail that I could use ?
>
>Thanks,

A number of collections of spam and legitimate messages can be accessed 
from my page on Machine Learning for spam detection, at 
http://www.esi.uem.es/~jmgomez/spam/index.html, including:

* Ling-spam (http://www.iit.demokritos.gr/~ionandr/lingspam_public.tar.gz) 
and PU1 (http://www.iit.demokritos.gr/~ionandr/pu1_encoded.tar.gz) by 
Androutsopoulos and colleagues

* Spambase (ftp://ftp.ics.uci.edu/pub/machine-learning-databases/spambase) 
hosted at the UCI Machine Learning Repository 
(http://www.ics.uci.edu/~mlearn/MLSummary.html) and built by George Forman 
and colleagues

These are relatively standard collections used for evaluating spam 
detection approaches, as you can see in my bibliography 
(http://liinwww.ira.uka.de/bibliography/Ai/MLSpamBibliography.html).

Koltz and colleagues comment in their paper (available at 
http://www-ai.ijs.si/DunjaMladenic/TextDM01/papers/Kolcz_TM.pdf) that they 
plan to make their spam collection public at their corporation website 
(http://personalogy.net/). This collection may be very interesting.

Alternatively, you can build a spam vs legitimate collection using widely 
known spam repositories. the problem is legitimate email, which is not 
usually public. As Androutsopoulos, you may use messages from a public 
list, but the most sensible approach is use some publicly donated personal 
email, in order to reflect personal email usage.

Hope this helps


>Cormac
>
>
>----------------------------
>Cormac O'Brien
>Department of Linguistics
>University of Gothenburg
>Box 200
>S-405 30 Gothenburg
>Sweden
>
>0046 (0)31 773 5234



_______________________________________________________________________________

Jose Maria Gomez Hidalgo
Departamento de Inteligencia Artificial
Universidad Europea de Madrid - CEES
28670 - Villaviciosa de Odon - MADRID
(+34) 912115670
jmgomez at dinar.esi.uem.es
http://www.esi.uem.es/~jmgomez/
_______________________________________________________________________________

La legislación española ampara el secreto de las comunicaciones. Este 
correo electrónico es estrictamente confidencial y va dirigido 
exclusivamente a su destinatario/a. Si no es Ud., le rogamos que no difunda 
ni copie la transmisión y nos lo notifique cuanto antes.

Spanish law guarantees privacy in electronic communications. This 
electronic transmission is strictly confidential and intended solely for 
the addressee. If you are not the intended addressee, you are kindly 
requested not to disclose nor to copy this transmission and to notify us as 
soon as possible.



More information about the Corpora mailing list