Last modified: Tue Sep 27 12:17:21 EDT 2005
by Dean Foster
Statistical Data mining: Stepwise regression
Admistrivia
- Lots of readings for today (including new one on JL theorem.)
Regression vs nearest neighbor (yet again)
- Nearest neighbor picks up rich structure from log(n) sized space
- Regression picks up linear structure from o(n) sized space
- Nearest neighbor degenerates to noise if space is too large
- Regression degenerates to "divide by zero" if space is too
large
- What to do?
- Pick a smaller space!
Best basis problem (best subsets regression)
- goal: find subspace that generates a good fit
- Same thing is needed for both regression and Nearest neighbor
NP complete problems
- Review complexity via sorting algorithms: O(n^2)
vs. O(2^n). (use cards)
- Can exponential problems be interesting?
- No, since we can never know if the answer were correct.
- Yes, since sometimes sometimes checking is easier than
deriving.
- These class of problems are called "NP".
- No known way to actually solve them quickly.
- Even quantium computers might not solve them quickly. (Open conjecture)
- All good machine learning problems fall in this class.
- Aside: NP complete
- The whole class is either all easy, or all hard
- If one problem (say TSP/3STAT, or stepwise regression).......
- Sudoku are good example of NP complete problem
- Millennium prize
Best subsets regression is NP complete
- Must phrase it carefully
- Do there exist q < p variables such that the R-sq'd is bigger
than x?
- Certainly this is polynomial time checkable (do the regression)
- Can it be found quickly?
- Try the best first? (Fails since might lead down wrong path)
- Add them all and eliminate useless ones (fails since p might
be bigger than n)
- reduce it to another previously shown to be hard problem
- So if someone says, "I've got a fast way to solve subsets
regression" they shouldn't be telling statisticians, instead they
should be telling the Clay foundation.
"All interesting problems are impossible."
Gready search
dean@foster.net