[Ads-l] ham (antonym of spam) 2003
Chris Waigl
chris at LASCRIBE.NET
Fri Jun 29 07:18:05 UTC 2018
One key aspect of the context in which "ham" arose as a counterpart to
"spam" is Bayesian spam filtering. This was the first technique with an
effectiveness that made a real difference in the fight against the spam
flood, and more importantly, could adapt automatically to new types of
spam. This was a great improvement over the older methods that relied on
the manual or sometimes automatic extraction of "spammy" keywords. Today's
machine learning based anti-spam methods still belong to the same wider
methodological family. The significance is that the method requires a
corpus of labeled samples of both spam email messages and those that are
not spam in order to train a statistical classifier. Labeling means that
during training phases, the classifier ingests a message and the associated
label. For spam messages, the actual label would typically be "spam". And
for messages that aren't spam, I remember early on that the label "ham" was
used.
There is a flurry of activity around Bayesian spam filters in academic
papers (and apparently in published software) around 1998, but during a
quick read through the key articles (eg Sahami et al.
http://robotics.stanford.edu/users/sahami/papers-dir/spam.pdf ) I didn't
find any mention of ham. The labels are called "legitimate" or "nonspam"
or "nospam". The first mention of ham that I can quickly find is in this
blog post
http://radio-weblogs.com/0101454/stories/2002/09/16/spamDetection.html by
Gary Robinson, from 2002, which explains a particular type of Bayesian spam
filtering methods (there was a jump in quality of the filtering that is
dated to around 2002). Here are the sentences that illustrate how he uses
how he uses the terms -- and also Tim Peter, whom he quotes. Note that both
spam and ham are count nouns for Robinson:
===
Let us start in an imaginary world where there are exactly as many spams as
nonspams in the training corpus. Then I'll bring it back to the real world
at the end. For now, there are the same number of spams and hams.
...
So, suppose we have a corpus with the same number of spams as hams, and we
want to calculate an expectation for the word "Nigeria." We start looking
through emails, one by one, incrementing n for each email that contains
"Nigeria," and incrementing y if that email is a spam.
...
This is all well and good, except for one thing: we don't have the same
number of spams and hams in our training data.
The question is, do we want to give more weight to evidence in favour of
spamminess or hamminess because of the fact that the particular individual
using a system built on this technique might happen to receive more ham
than spam (or vice versa)?
...
Here I will quote from Tim Peters' description of the Grahamian method (by
personal email):
> suppose Nigeria appears in 50 ham and 50 spam... In [Graham's]
> scheme, it also depends on the total number of hams and of spams. Say
there
> are 100 spams and 200 hams. Then Graham effectively says "OK, it appears
in
> half of the spams, but only a quarter of the hams, so if I see Nigeria
it's
> twice as likely to be a spam, so the spamprob is really 2/3"
(If you're familiar with Graham's technique, note that we're ignoring the
fact that he multiplies the ham counts by 2, which testing is showing is
not necessary or helpful when doing things as described in this essay.)
...
Let p(w) be the result of the Grahamian calculation described by Tim
Peters. It is easy to see that the expression (n * p(w)), where n is the
number of "Nigeria" instances looked at, approximates the value the counter
I describe above WOULD have had if we actually had the same number of spams
and hams (of course this becomes more true the larger the number of spams
and hams we have).
For instance, in Tim's example, let's assume that instead of 100 spams and
200 hams, there were 150 spams and 150 hams, leaving the total number of
emails at the same 300. Then "Nigeria" would have appeared in approximately
75 spams and 37.5 hams. So, our counters would have arrived at
approximately the values:
...
The expectation of ham or spam for Nigeria could both be separately
calculated using a binomial random variable with a beta prior.
====
I did a quick read of the Paul Graham stuff, but what I saw didn't use
"ham".
Chris
On Thu, Jun 28, 2018 at 9:28 PM, Barretts Mail <mail.barretts at gmail.com>
wrote:
> Nice, thank you! One of those sources is likely the UD entry’s source.
>
> It does seem that most sources have zeroed in on email as the only form of
> spam. The Wiktionary (https://en.wiktionary.org/wiki/spam <
> https://en.wiktionary.org/wiki/spam>) use of “electronic messages” for
> spam could be construed to include Facebook and other cybercomments, but
> that doesn’t seem like the ordinary meaning to me. BB
>
> > On 28 Jun 2018, at 22:23, Ben Zimmer <bgzimmer at GMAIL.COM> wrote:
> >
> > Paul McFedries did an entry for "ham" in 2003 with examples from earlier
> > that year.
> >
> > https://wordspy.com/index.php?word=ham
> >
> > I'm reminded of a 2007 coinage, "bacn" ("email you want but not right
> now"):
> >
> > https://wordspy.com/index.php?word=bacn
> >
> > "Bacn" was nominated that year for ADS WOTY (Most Useful category) and
> was
> > also a runner-up for Oxford WOTY.
> >
> > https://www.americandialect.org/Word-of-the-Year_2007.pdf
> > https://blog.oup.com/2007/11/Locavore/
> >
> >
> > On Fri, Jun 29, 2018 at 12:14 AM, Barretts Mail <mail.barretts at gmail.com
> >
> > wrote:
> >
> >> A graphic for Akismet, the spam-catcher, says “1. Visitors submit
> comments
> >> on your blog…. 3. Akismet tells your blog whether it’s ham or spam."
> >>
> >> Neither the Oxford Living Dictionaries nor Merriam-Webster (
> >> https://www.merriam-webster.com/dictionary/ham <
> >> https://www.merriam-webster.com/dictionary/ham>) nor Dictionary.com <
> >> http://dictionary.com/> (http://www.dictionary.com/browse/ham?s=t <
> >> http://www.dictionary.com/browse/ham?s=t>) lists this meaning.
> >>
> >> I found 189 entries for “ham” on Urban Dictionary, and on 30 August
> 2003,
> >> Paul Zurawski defined “ham” as “real email - not spam” (
> >> https://bit.ly/2IBy3ZT <https://bit.ly/2IBy3ZT>).
> >>
> >> Wiktionary (https://en.wiktionary.org/wiki/ham <
> >> https://en.wiktionary.org/wiki/ham>) has” (Internet, informal,
> uncommon)
> >> Electronic mail that is wanted; mail that is not spam or junk mail.” and
> >> lists “spam” as an acronym. The “uncommon” label notwithstanding, the
> >> expression “ham or spam” yields 58K raw Googits.
> >>
> >> As the Akismet example shows, “ham" is not limited to email but also
> >> includes blog comments, and probably any other sort of cybercomment,
> such
> >> as on Facebook.
> >>
> >> It’s possible that the substitution of ham/SPAM in cooking may have
> >> contributed to this expression. The earliest such usage I found is 1999:
> >>
> >> 1 December 1999
> >> "Oral History: Wilmer “Bill” Cox Morris” by the Historical Committee of
> >> the Outrigger Canoe Club (https://bit.ly/2lDWc90 <
> https://bit.ly/2lDWc90>)
> >> I got some pressed ham or Spam in a sandwich with onion in it and an
> apple
> >> and a cup of coffee.
> >>
> >> I have not searched Google Groups or any other such archive.
>
>
> ------------------------------------------------------------
> The American Dialect Society - http://www.americandialect.org
>
--
Chris Waigl . chris.waigl at gmail.com . chris at lascribe.net
http://eggcorns.lascribe.net . http://chryss.eu
------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org
More information about the Ads-l
mailing list