Corpora: Corpus Junk mail
Jose Maria Gomez Hidalgo
jmgomez at dinar.esi.uem.es
Tue Mar 12 09:30:55 UTC 2002
At 16:18 11/03/2002 +0100, you wrote:
>Hi,
>
>I'm planning to write a program that uses statistical methods to identify
>junk e-mail. Does anyone know of a corpus of junk mail that I could use ?
>
>Thanks,
A number of collections of spam and legitimate messages can be accessed
from my page on Machine Learning for spam detection, at
http://www.esi.uem.es/~jmgomez/spam/index.html, including:
* Ling-spam (http://www.iit.demokritos.gr/~ionandr/lingspam_public.tar.gz)
and PU1 (http://www.iit.demokritos.gr/~ionandr/pu1_encoded.tar.gz) by
Androutsopoulos and colleagues
* Spambase (ftp://ftp.ics.uci.edu/pub/machine-learning-databases/spambase)
hosted at the UCI Machine Learning Repository
(http://www.ics.uci.edu/~mlearn/MLSummary.html) and built by George Forman
and colleagues
These are relatively standard collections used for evaluating spam
detection approaches, as you can see in my bibliography
(http://liinwww.ira.uka.de/bibliography/Ai/MLSpamBibliography.html).
Koltz and colleagues comment in their paper (available at
http://www-ai.ijs.si/DunjaMladenic/TextDM01/papers/Kolcz_TM.pdf) that they
plan to make their spam collection public at their corporation website
(http://personalogy.net/). This collection may be very interesting.
Alternatively, you can build a spam vs legitimate collection using widely
known spam repositories. the problem is legitimate email, which is not
usually public. As Androutsopoulos, you may use messages from a public
list, but the most sensible approach is use some publicly donated personal
email, in order to reflect personal email usage.
Hope this helps
>Cormac
>
>
>----------------------------
>Cormac O'Brien
>Department of Linguistics
>University of Gothenburg
>Box 200
>S-405 30 Gothenburg
>Sweden
>
>0046 (0)31 773 5234
_______________________________________________________________________________
Jose Maria Gomez Hidalgo
Departamento de Inteligencia Artificial
Universidad Europea de Madrid - CEES
28670 - Villaviciosa de Odon - MADRID
(+34) 912115670
jmgomez at dinar.esi.uem.es
http://www.esi.uem.es/~jmgomez/
_______________________________________________________________________________
La legislación española ampara el secreto de las comunicaciones. Este
correo electrónico es estrictamente confidencial y va dirigido
exclusivamente a su destinatario/a. Si no es Ud., le rogamos que no difunda
ni copie la transmisión y nos lo notifique cuanto antes.
Spanish law guarantees privacy in electronic communications. This
electronic transmission is strictly confidential and intended solely for
the addressee. If you are not the intended addressee, you are kindly
requested not to disclose nor to copy this transmission and to notify us as
soon as possible.
More information about the Corpora
mailing list