Last modified: Wed Oct 19 19:34:08 EDT 2005 by Dean Foster

Statistical Data mining: Bayes and spam

Admistrivia

The war on spam

Email history:
- In the beginning... you called to ask if an email went through
- Then it was reliable: (Early 90's)
- Now there is spam and you call to ask if the email was recieved
How can we solve this problem?
- In economics: with money (i.e. pay to send email--spammers can't afford it)
- In theoretical CS: with cryptography (i.e. sign messages. Spammers don't have valid signatures.)
- In applied CS: via black listing (i.e. Real Time Blackholes)
- In law: with laws obviously
- In statistics: via Naive bayes

Simplify the data

Bag of words / set of words
Lose order
"Dog bites man" and "man bites dog" both the same.
Wonderful for statistics: one big table
- Rows are emails
- Columns are word counts
- Y is hand coded
Data is expensive: Must ask users to classify

Why not a full regression model?

n = 100, p = 100,000
Very quickly run out of degrees of freedom

Toy regression model (basically what my filter does)

regress Y on each word: generates p different regressions
Average all these y-hats
Generates a "score" for each email
Avoids multiple regression

Bayesian model: Justification of the toy regression methodology

P(Y|X1,X2,...,Xp) = k P(Xs|Y) P(Y)
Assume: P(Xs|Y) = P(X1|Y)*...P(Xp|Y)
Now log odds ratio of Y|Xs is easy
log(P(Y=1|Xs)/P(Y=0|Xs)) = log(P(Y=1)/P(Y=0)) + Sum log(P(Xi=xi|Y=1)/P(Xi=xi|Y=1))
Called: Idiot's Bayes, or Naive Bayes

Asside: Helped get OJ off the hook

In computing the probabiliyt of a blood match, this model used to be used
probabilities of 1 in a billion were then quoted
Then asked: What is the chance of an error in methodology? It is much higher than 1/billion.
Expert looks stupid.

Obviously not calibrated

We could calibrate it

To use the output in other systems this might be useful

But to use it for filtering it isn't necessary

We just need to pick a threashold and kill everything over that threashold

How effective is it?

Sahami (1998) false negative = 12% false positive = 3%
with some hand tuning false negative = 4% false positive = 0%
Androutsopoulos (2000) fals negative under 1%

Problem: How do we estimate P(X|Y)?

Good-Turing methodology (aka empirical Bayes)
(r+1) #(r+1)/#(r)
See Gale's article

quote:

Olny srmat poelpe can.

cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg. The phaonmneal pweor of the hmuan mnid, aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoatnt tihng is taht the frist and lsat ltteer be in the rghit pclae. The rset can be a taotl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef! ! ! , but the wrod as a wlohe. Amzanig huh? yaeh and I awlyas tghuhot slpeling was ipmorantt!

dean@foster.net