# Statistical Data mining: Introduction

## Class structure

• credit based mostly homeworks
• readings (see web since I'll often forget to mention them)
• You might have to surf the web to find background readings. If so, send me pointers to useful links and I'll put the on the web page.
• If you don't understand -- ask!
• Final exam

## What is data mining?

Data mining doesn't have a good definition:
• Kurt Thearling views it as a set of problem.
• Andrew Moore views it a collection of techniques.
• Our definition will be statistics for very large datasets.

# The data

The data comes from many sources:
• Marketing
• medicine
• the Web
• text sources
• UCI / KDD! (Practice data sets)
• Maybe even financial data (yuck! This doesn't have any signal and so generally can be dealt with using classical statistics).
Unifying properties of the data:
• Something worth predicting is measured
• often times lots of signal (think high R-squares)
• Always lots of variables (think 1000s or millions)
• Lots of data (think Gigabytes)
The data is mostly observational
• Not collected for science, so the goal isn't to find truth (i.e. true models).
• Goal is to predict.
• Causal reasoning probably isn't justified. Is it necessary though?

## The methods

• Machine learning (used to be called AI)
• modification of existing statistical methods
• commercial ones are often ad hoc
• see Elements of Statistical Learning (You might want to buy this book)

## The modeling spectrum

Under-fitting vs over-fitting
• Simple models will under fit (traditional statistics)
• complex models have more parameters than data. Hence often over fix.
• (Aside: what is over fitting? Ill defined, draw graph)
• Goal is to find compromise between them
• Linear regression
• If it finds signal, we know that it is real
• very finite dimensional
• as traditional statistics as you can get
• Nearest neighbor
• Will find "signal" in anything
• dimensionality is equal to number of data points
Which is better?
• It deals with statistical significance well.
• It is efficient
• Lives in standard error space (+/- 1/sqrt(n))