Last modified: Mon Aug 22 15:53:00 EDT 2005
by Dean Foster

# Statistical Data mining: OUTLINE

## Outline: (26 days of class)

- Introduction: The modeling spectrium (1 class)
- linear regression: the paradign low dimention model
- nearest neighbor: the paradign infinite dimention model
- course goals: find a happy middle ground

- Introduction to high dimensional data (intuition isn't a good
guide anymore)
- Variable Selection: (4 classes)
- Bonferroni/Risk inflation
- FDR/Simes
- alpha spending, alpha investing
- Information theory

- Wavelets: Such pretty pictures!
- Loss functions (1 class)
- Classification loss
- proper scoring rules: mixtures of clasification loss
- KL divergence, quadratic losses

- Lasso (1 classes) l1-priors
- regularization (1 class) l2 priors
- Computing p-values (3 classes)
- White estimator / GEEs
- Bennett's bound and other "tight" probabilistic bounds

- variable creation (4 classes)
- interactions
- missing data
- RKHS
- PCA
- tree stubs

- Searching for natural kinds (2 classes)
- Text data (2 classes)
- The wikipedia
- bag of words model
- Using other peoples parses

- Inductive Logic Programming (2 classes)
- citation graphs (i.e. links in wikipedia / www)
- expanding activation

- Non-regression methods (4 classes)
- SVM
- Trees
- boosting
- comparison to regression

(current total: 27 classes)

Last modified: Thu Oct 6 12:02:12 EDT 2005