Statistical Data Mining

Schedule (T Th 3:00-4:30 in G92)

Sept 8: The modeling spectrum
- read: HTF pp 415-420

Sept 13: A picture of high dimensions
- No reading. Start on the "new" first homework assignment.
Sept 15: Nearest neighbors in high dimensions
- read: Johnson-Lindenstrauss lemma.
- read: Database friendly projections
- read: Yuval Peres's chapter on Johnson-Lindestrauss from his lecture notes.
- You now know enough to complete homework 1.

Sept 20:Stepwise regression
- start homework 2
- read: Best basis problem
- read: William J. Welch, "Algorithmic Compuexity: Three NP-Hard problems in computational statistics," (.pdf) J. Statist. Comput. Simul. 1982. (added 2014: There is more recent work on NP completeness of varaible selection. Natarajan has one, Michael Jordan has a piece, and I'm working on one.)
- read: Wikipedia's article on NP-complete. Someone should add the Welch result to the list of NP-complete problems. Do not get distracted by the page on sudoku.
Sept 22: Bonferroni
- read carefully the first 4 sections of Risk inflation.

Sept 27: Risk Inflation (Homework 1 due)
- look over: Donoho and Johnstone (1994) Ideal Denoising in an Orthonormal Basis Chosen from a Library of Bases (.pdf)
Sept 29: Curve fitting and wavelets
- Read carefully: Donoho and Johnstone's (1994) wavelet paper.

Oct 4: Proper scoring rules
- Read: General method for comparing probability assessors, by Mark Schervish.
Oct 6: Alternative scoring rules and calibration, Homework 2 due

Oct 11: Spam: Bag of words and Naive-Bayes
- Talk in OPIM at Noon in G50 (free food)
- Read NYT data mining article (mirror)
- Read: Madigan's paper on Naive Bayes
- Read: as usual the wiki on Bayes filtering and Naive Bayes.
- Read about the Good-turing estimator for rare event probabilities.
Oct 13: Tails
- Read: proof of Chernoff used in class.
- Read: Either Hal White's orginal sandwich estimator paper, or GEE paper by Liang and Zeiger

Oct 17/18: (Fall break)
Oct 20: Sandwich estimator

Oct 25: graphs
- discussion of A* algorithm
Oct 27: SFS: streaming feature selection
- Note: IB1 is a nearest neighbor algorithm
- read: SFS by Aha

Nov 1: Alpha spending
- Homework 3 due start on hw4
- Read one page background on alpha spending.
- Read: Excess discovery count (.ps) by Bob and me
Nov 3: FDR and EDC
- Surf Yoav's page on FDR.

Nov 8: Support vector machines (guest lecture by Jon)
Nov 10: RKHS
- Homework 4 due
- nice source of information about support vectors and RKHS
- In particular see RKHS and regression using rkhs. You might find it easier to read the pdf files rather than the html file.
- Read section 5.8 of Hastie, Tibshirani, Friedman that I handed out.

Nov 15: Support vector machines: part II (guest lecture by Jon)
Nov 17: Fitting RKHS using LS

Nov 22: Clustering
- read handout: pages 412-413 and 461-464 of Hastie, Tibshirani, Friedman.
- Read Tali Tishby and Eyal Krupka's NIPS paper.

Nov 29:Trees
- read handout: pages 266 -289 of Hastie, Tibshirani, Friedman.
Dec 1: Information theory
- read Bob's gentle introduction to information theory.
- Feel free to talk to Bob, Adi or me about information theory.
- A book on information theory that is better than Harry Potter
- For general information on information theory.

Dec 6: Alternative models of data
- read Rick's and my review paper.
- My first annals of stat paper was in this area. Guess what? It was on regression! I've written several papers on this.
Dec 8: Summary

Dec 19: Homework 5 due
Dec 21: Late date for homework 5.

General information

Some estimate that there is now 4 exabytes of data being produced each year. This is a different world than that which Fisher pioneered. He developed a theory that can deal with a 2x2 contingency table which might have a total of 4 bytes of data in it. This 10¹⁸ increase in data is changing the world of statistics. The goal if this course is follow this change.

Exactly what data mining is depends on who you talk to. For example, Andrew Moore takes a very wide view of data mining. He includes lovely topics from economics (i.e. game theory) to topics from classical AI (i.e. A* algorithm). This will contrast with the approach I will take. I'll focus much more highly on statistical regression.

I've written a crude outline of what the course will cover.

Prerequisites

This course is targeted at PhD students. Some mathematical sophistication will be assumed. You will be expected to carefully read research papers. The primary statistics tool will be regression, so at least a few weeks of background on that would be desirable. If you are unsure, send me an email and we can chat.

Last modified: Thu Sep 25 13:00:07 EDT 2014