Last modified: Tue Sep 27 12:16:53 EDT 2005
by Dean Foster
Statistical Data mining: High dimensions
First intuition: Always think p > n
Classical statistics has p finite, and n close to infinite.
Short and fat data has p bigger than n. Natural limit is either n
fixed and p goes to infinity. Or both go to infinity.
How bad is our intuition about large dimensions?
- Mike Steele's example of the square and the circle.
- Theorem: All random vectors are approximately orthoganal.
- consider Xi a d-dimensional normal
- if we have d of them, we span the d-dimensional space
- But the d+1'st variable we add is still almost orthagonal
to the d variables.
- Holds true up to exponentially many (Bonus homework: prove
this!)
- Above theorem is used in quick proof of shrinkage. (Draw
picture to confuse students. Use Pythagoras's proof: Behold.)
dean@foster.net