Statistics 701: Bankruptcy

Announcement

pass out homework 2a and discuss it a bit
Don't forget to email me your group memberships and which day you want to talk on

Existing methods:
- Neural nets
- C 4.5 / decision trees
- Boosting
All called "data mining."
Difference between data mining and statistics: If you are doing datamining, you can charge millions, if you are doing statistics, you can charge thousands.
Done by database experts and computer scientists

Can standard regression compete with data mining?
In its current form, it requires lots of hands on control
Certainly for small data sets of importance, this is fine, but what of meaningless but large datasets?
Everything must be handled:
- hetroskadasticity
- outliers
- leverage points
- influential points
- curvature
- synergies (i.e. interactions
- Independence? (NOPE! No statistical method can deal with this)
- missing data
We think yes.

The problem
- a million people
- several thousand bankruptcies
- lots of useless variables (350 before interactions)
- interested only in prediction
The methodology
- sandwich estimator (deals with hetroskadasticity)
  - also called the White estimate
  - We modify it by using fit from previous round for SD estimates instead of fit from this round
- Use indicators for missing values
- use Bonferroni for significance
Nickel summary: page 7, the lift chart. (estimated on 20% predicted remaining 80%.)
Bonferroni approximate: sqrt(2 log p)
We actually used what I call a better Bonferroni (but it doesn't matter. It improves the out of sample fit by .03 percent. Note, I already multiplied by 100!)

Do I really need to discuss this??? You alls understand it really well.
Sure, beat the dead horse
page 31 is in-sample errors as we add variables.
Should we stop at 20? 40? 100? 200?
Answer on page 32. Usual bonferroni is 12, better bonferroni is 39. Yes the better is better, but not by much!

Last modified: Tue Nov 27 13:18:42 2001