Don't forget to email me your group memberships and which day you want to talk on
Automatic forecasting
Existing methods:
Neural nets
C 4.5 / decision trees
Boosting
All called "data mining."
Difference between data mining and statistics: If you are doing
datamining, you can charge millions, if you are doing
statistics, you can charge thousands.
Done by database experts and computer scientists
Can regression compete?
Can standard regression compete with data mining?
In its current form, it requires lots of hands on control
Certainly for small data sets of importance, this is fine, but
what of meaningless but large datasets?
Everything must be handled:
hetroskadasticity
outliers
leverage points
influential points
curvature
synergies (i.e. interactions
Independence? (NOPE! No statistical method can deal
with this)
missing data
We think yes.
Bankruptcy example
The problem
a million people
several thousand bankruptcies
lots of useless variables (350 before interactions)
interested only in prediction
The methodology
sandwich estimator (deals with hetroskadasticity)
also called the White estimate
We modify it by using fit from previous round for
SD estimates instead of fit from this round
Use indicators for missing values
use Bonferroni for significance
Nickel summary: page 7, the lift chart. (estimated on 20%
predicted remaining 80%.)
Bonferroni approximate: sqrt(2 log p)
We actually used what I call a better Bonferroni (but it
doesn't matter. It improves the out of sample fit by .03
percent. Note, I already multiplied by 100!)
Why use Bonferroni
Do I really need to discuss this??? You alls understand it
really well.
Sure, beat the dead horse
page 31 is in-sample errors as we add variables.
Should we stop at 20? 40? 100? 200?
Answer on page 32. Usual bonferroni is 12, better bonferroni
is 39. Yes the better is better, but not by much!
Calibration
I've studied this theoretically in 3 papers, so I had to
discuss it here
Draw picture of calibrated curves and uncalibrated curves.