# Statistical Data mining: Bayes and spam

## The war on spam

• Email history:
• In the beginning... you called to ask if an email went through
• Then it was reliable: (Early 90's)
• Now there is spam and you call to ask if the email was recieved
• How can we solve this problem?
• In economics: with money (i.e. pay to send email--spammers can't afford it)
• In theoretical CS: with cryptography (i.e. sign messages. Spammers don't have valid signatures.)
• In applied CS: via black listing (i.e. Real Time Blackholes)
• In law: with laws obviously
• In statistics: via Naive bayes

## Simplify the data

• Bag of words / set of words
• Lose order
• "Dog bites man" and "man bites dog" both the same.
• Wonderful for statistics: one big table
• Rows are emails
• Columns are word counts
• Y is hand coded
• Data is expensive: Must ask users to classify

## Why not a full regression model?

• n = 100, p = 100,000
• Very quickly run out of degrees of freedom

## Toy regression model (basically what my filter does)

• regress Y on each word: generates p different regressions
• Average all these y-hats
• Generates a "score" for each email
• Avoids multiple regression

## Bayesian model: Justification of the toy regression methodology

• P(Y|X1,X2,...,Xp) = k P(Xs|Y) P(Y)
• Assume: P(Xs|Y) = P(X1|Y)*...P(Xp|Y)
• Now log odds ratio of Y|Xs is easy
• log(P(Y=1|Xs)/P(Y=0|Xs)) = log(P(Y=1)/P(Y=0)) + Sum log(P(Xi=xi|Y=1)/P(Xi=xi|Y=1))
• Called: Idiot's Bayes, or Naive Bayes

## Asside: Helped get OJ off the hook

• In computing the probabiliyt of a blood match, this model used to be used
• probabilities of 1 in a billion were then quoted
• Then asked: What is the chance of an error in methodology? It is much higher than 1/billion.
• Expert looks stupid.

## Obviously not calibrated

• We could calibrate it
• To use the output in other systems this might be useful
• But to use it for filtering it isn't necessary
• We just need to pick a threashold and kill everything over that threashold

## How effective is it?

• Sahami (1998) false negative = 12% false positive = 3%
• with some hand tuning false negative = 4% false positive = 0%
• Androutsopoulos (2000) fals negative under 1%

## Problem: How do we estimate P(X|Y)?

• Good-Turing methodology (aka empirical Bayes)
• (r+1) #(r+1)/#(r)
• See Gale's article
quote:

Olny srmat poelpe can.

cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg. The phaonmneal pweor of the hmuan mnid, aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoatnt tihng is taht the frist and lsat ltteer be in the rghit pclae. The rset can be a taotl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef! ! ! , but the wrod as a wlohe. Amzanig huh? yaeh and I awlyas tghuhot slpeling was ipmorantt!

dean@foster.net