Last modified: Mon Aug 22 15:53:00 EDT 2005 by Dean Foster

Statistical Data mining: OUTLINE

Outline: (26 days of class)

Introduction: The modeling spectrium (1 class)
- linear regression: the paradign low dimention model
- nearest neighbor: the paradign infinite dimention model
- course goals: find a happy middle ground
Introduction to high dimensional data (intuition isn't a good guide anymore)
Variable Selection: (4 classes)
- Bonferroni/Risk inflation
- FDR/Simes
- alpha spending, alpha investing
- Information theory
Wavelets: Such pretty pictures!
Loss functions (1 class)
- Classification loss
- proper scoring rules: mixtures of clasification loss
- KL divergence, quadratic losses
Lasso (1 classes) l1-priors
regularization (1 class) l2 priors
Computing p-values (3 classes)
- White estimator / GEEs
- Bennett's bound and other "tight" probabilistic bounds
variable creation (4 classes)
- interactions
- missing data
- RKHS
- PCA
- tree stubs
Searching for natural kinds (2 classes)
- clustering
- RBF
Text data (2 classes)
- The wikipedia
- bag of words model
- Using other peoples parses
Inductive Logic Programming (2 classes)
- citation graphs (i.e. links in wikipedia / www)
- expanding activation
Non-regression methods (4 classes)
- SVM
- Trees
- boosting
- comparison to regression

(current total: 27 classes)

Last modified: Thu Oct 6 12:02:12 EDT 2005