Statistical Data Mining
Schedule (T Th 3:00-4:30 in G92)
- Nov 8: Support vector machines (guest lecture by Jon)
- Nov 10: RKHS
- Homework 4 due
- nice source
of information about support vectors and RKHS
- In particular see RKHS and
regression
using rkhs.
You might find it easier to read the pdf files rather than the html
file.
- Read section 5.8 of Hastie, Tibshirani, Friedman that I handed out.
- Nov 22: Clustering
- read handout: pages 412-413 and 461-464 of Hastie, Tibshirani,
Friedman.
- Read Tali Tishby and Eyal
Krupka's NIPS paper.
- Dec 6: Alternative models of
data
- read Rick's and my review paper.
- My first annals of stat paper was in this area. Guess what?
It was on regression! I've written several papers on this.
- Dec 8: Summary
General information
Some estimate that there is now 4 exabytes of data being produced each
year. This is a different world than that which Fisher pioneered. He
developed a theory that can deal with a 2x2 contingency table which
might have a total of 4 bytes of data in it. This 1018
increase in data is changing the world of statistics. The goal if
this course is follow this change.
Exactly what data mining is depends on who you talk to. For example,
Andrew Moore takes a very
wide view of data
mining. He includes lovely topics from economics (i.e. game
theory) to topics from classical AI (i.e. A* algorithm). This will
contrast with the approach I will take. I'll focus much more highly
on statistical regression.
I've written a crude outline of what the
course will cover.
Prerequisites
This course is targeted at PhD students. Some mathematical
sophistication will be assumed. You will be expected to carefully
read research papers. The primary statistics tool will be regression,
so at least a few weeks of background on that would be desirable. If
you are unsure, send me an email and we can chat.
Last modified: Thu Sep 25 13:00:07 EDT 2014