Last modified: Wed Oct 19 19:34:08 EDT 2005 by Dean Foster

Statistical Data mining: Bayes and spam


The war on spam

Simplify the data

Why not a full regression model?

Toy regression model (basically what my filter does)

Bayesian model: Justification of the toy regression methodology

Asside: Helped get OJ off the hook

Obviously not calibrated

  • We could calibrate it
  • To use the output in other systems this might be useful
  • But to use it for filtering it isn't necessary
  • We just need to pick a threashold and kill everything over that threashold

    How effective is it?

    Problem: How do we estimate P(X|Y)?


    Olny srmat poelpe can.

    cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg. The phaonmneal pweor of the hmuan mnid, aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoatnt tihng is taht the frist and lsat ltteer be in the rghit pclae. The rset can be a taotl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef! ! ! , but the wrod as a wlohe. Amzanig huh? yaeh and I awlyas tghuhot slpeling was ipmorantt!