Last modified: Wed Oct 19 19:34:08 EDT 2005
by Dean Foster

# Statistical Data mining: Bayes and spam

## Admistrivia

## The war on spam

- Email history:
- In the beginning... you called to ask if an email went
through
- Then it was reliable: (Early 90's)
- Now there is spam and you call to ask if the email was
recieved

- How can we solve this problem?
- In economics: with money (i.e. pay to send
email--spammers can't afford it)
- In theoretical CS: with cryptography (i.e. sign messages. Spammers
don't have valid signatures.)
- In applied CS: via black listing (i.e. Real Time Blackholes)
- In law: with laws obviously
- In statistics: via Naive bayes

## Simplify the data

- Bag of words / set of words
- Lose order
- "Dog bites man" and "man bites dog" both the same.
- Wonderful for statistics: one big table
- Rows are emails
- Columns are word counts
- Y is hand coded

- Data is expensive: Must ask users to classify

## Why not a full regression model?

- n = 100, p = 100,000
- Very quickly run out of degrees of freedom

## Toy regression model (basically what my filter does)

- regress Y on each word: generates p different regressions
- Average all these y-hats
- Generates a "score" for each email
- Avoids multiple regression

## Bayesian model: Justification of the toy regression methodology

- P(Y|X1,X2,...,Xp) = k P(Xs|Y) P(Y)
- Assume: P(Xs|Y) = P(X1|Y)*...P(Xp|Y)
- Now log odds ratio of Y|Xs is easy
- log(P(Y=1|Xs)/P(Y=0|Xs)) = log(P(Y=1)/P(Y=0)) + Sum log(P(Xi=xi|Y=1)/P(Xi=xi|Y=1))
- Called: Idiot's Bayes, or Naive Bayes

## Asside: Helped get OJ off the hook

- In computing the probabiliyt of a blood match, this model used
to be used
- probabilities of 1 in a billion were then quoted
- Then asked: What is the chance of an error in methodology? It
is much higher than 1/billion.
- Expert looks stupid.

## Obviously not calibrated

We could calibrate it
To use the output in other systems this might be useful
But to use it for filtering it isn't necessary
We just need to pick a threashold and kill everything over that threashold
## How effective is it?

- Sahami (1998) false negative = 12% false positive = 3%
- with some hand tuning false negative = 4% false positive = 0%
- Androutsopoulos (2000) fals negative under 1%

## Problem: How do we estimate P(X|Y)?

- Good-Turing methodology (aka empirical Bayes)
- (r+1) #(r+1)/#(r)
- See Gale's article

quote:
Olny srmat poelpe can.

cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was
rdanieg. The phaonmneal pweor of the hmuan mnid, aoccdrnig to a
rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr
the ltteers in a wrod are, the olny iprmoatnt tihng is taht the frist
and lsat ltteer be in the rghit pclae. The rset can be a taotl mses
and you can sitll raed it wouthit a porbelm. Tihs is bcuseae the huamn
mnid deos not raed ervey lteter by istlef! ! ! , but the wrod as a
wlohe. Amzanig huh? yaeh and I awlyas tghuhot slpeling was ipmorantt!

dean@foster.net