Last modified: Tue Sep 27 12:17:06 EDT 2005
by Dean Foster

# Statistical Data mining: Introduction

## Class structure

- credit based mostly homeworks
- readings (see web since I'll often forget to mention them)
- You might have to surf the web to find background readings. If
so, send me pointers to useful links and I'll put the on the web page.
- If you don't understand -- ask!
- Final exam

## What is data mining?

Data mining doesn't have a good definition:
- Kurt
Thearling views it as a set of problem.
- Andrew Moore
views it a collection of techniques.
- Our definition will be statistics for very large datasets.

# The data

The data comes from many sources:
- Marketing
- medicine
- the Web
- text sources
- UCI / KDD! (Practice data sets)
- Maybe even financial data (yuck! This doesn't have any signal
and so generally can be dealt with using classical statistics).

Unifying properties of the data:
- Something worth predicting is measured
- often times lots of signal (think high R-squares)
- Always lots of variables (think 1000s or millions)
- Lots of data (think Gigabytes)

The data is mostly observational
- Not collected for science, so the goal isn't to find
truth (i.e. true models).
- Goal is to predict.
- Causal reasoning probably isn't justified. Is it necessary though?

## The methods

- Machine learning (used to be called AI)
- modification of existing statistical methods
- commercial ones are often ad hoc
- see Elements of
Statistical Learning (You might want to buy this book)

## The modeling spectrum

Under-fitting vs over-fitting
- Simple models will under fit (traditional statistics)
- complex models have more parameters than data. Hence often
over fix.
- (Aside: what is over fitting? Ill defined, draw graph)
- Goal is to find compromise between them

Paradigm methods
- Linear regression
- If it finds signal, we know that it is real
- very finite dimensional
- as traditional statistics as you can get

- Nearest neighbor
- Will find "signal" in anything
- dimensionality is equal to number of data points

Which is better?
- Advantage of regression
- It deals with statistical significance well.
- It is efficient
- Lives in standard error space (+/- 1/sqrt(n))

- Advantage of nearest neighbor
- It doesn't miss signal well
- Today's theorem: E|Y-hat(y|x)| < 2 E|Y - E(Y|X)|.
- assume n large, E(Y|X) = h(X) is smooth
- assume homoskadasticity (at least locally)
- then hat(y|x) - E(Y|X) is E|Y - E(Y|X)|
- Now use triangle inequality
- See Hastie, Tibshirani, Friedman p 415-420.

- It lives in standard deviation space (+/- 1)

- Neither uniformly best
- Goal of course is to meet in the middle

dean@foster.net