Last modified: Tue Sep 27 12:17:06 EDT 2005 by Dean Foster

Statistical Data mining: Introduction

Class structure

credit based mostly homeworks
readings (see web since I'll often forget to mention them)
You might have to surf the web to find background readings. If so, send me pointers to useful links and I'll put the on the web page.
If you don't understand -- ask!
Final exam

Data mining doesn't have a good definition:

The data comes from many sources:

Marketing
medicine
the Web
text sources
UCI / KDD! (Practice data sets)
Maybe even financial data (yuck! This doesn't have any signal and so generally can be dealt with using classical statistics).

Unifying properties of the data:

The data is mostly observational

Under-fitting vs over-fitting

Paradigm methods

Linear regression
- If it finds signal, we know that it is real
- very finite dimensional
- as traditional statistics as you can get
Nearest neighbor
- Will find "signal" in anything
- dimensionality is equal to number of data points

Which is better?

Advantage of regression
- It deals with statistical significance well.
- It is efficient
- Lives in standard error space (+/- 1/sqrt(n))
Advantage of nearest neighbor
- It doesn't miss signal well
- Today's theorem: E|Y-hat(y|x)| < 2 E|Y - E(Y|X)|.
  - assume n large, E(Y|X) = h(X) is smooth
  - assume homoskadasticity (at least locally)
  - then hat(y|x) - E(Y|X) is E|Y - E(Y|X)|
  - Now use triangle inequality
  - See Hastie, Tibshirani, Friedman p 415-420.
- It lives in standard deviation space (+/- 1)
Neither uniformly best
Goal of course is to meet in the middle

dean@foster.net