Last modified: Tue Sep 27 12:16:59 EDT 2005
by Dean Foster
Statistical Data mining: High dimensions (part 2)
Admistrivia
Lecture overview
- Clearly regression doesn't extend to p > n
- How about nearest neighbor?
- This will be the meat of todays lecture
Model
Model: Truth is a linear function. I.e. Y = X beta.
- No error.
- So theorem from last time says: nearest neighbor should be
perfect.
- Certainly true using classical statistics asymptotics
- draw one-d picture
2-d example
- 2-d plot. X axis has signal, Y no signal
- three cases: Y no spread, Y simillar spread, Y very spread out
axis
- In very spread out case, nearest neighbor is no better than
chance at getting a good fit.
2-d heuristic for d-dimensional problem
- 2-d plot. X axis is signal, Y axis shows orthognal component
- most points have distance about sqrt(d) from each other on y
axis
- Suppose 2-d plot was correct. Draw picture. If n < sqrt(d)
nearest neighbor is no better than chance at getting a good fit.
How close are nearest neighbors?
- distance squared to closest point is about d - sqrt(4 d log n)
- Proof: write as sums of squares
- squared distance then is normal
- mean = d, variance = 2d.
- Tails of a normal say extreme value is about sqrt(2 log n)
from mean in SD units
- "close" if d is approximately equal to log n. (CLT breaks down
here, so above approximation fails and thus avoids negative distances.)
Intuition via Johnson-Lindenstrauss lemma
- In 1984, Johnson-Lindenstrauss proved that you can approximate
the distances between n points living in d dimensions using about
log(n) dimensions.
- Since nearest neighbor only needs these distances, it can be so
approximated.
- These log(n) dimensions can be picked randomly. (See Dimitris
Achlioptas' paper)
- Each of these log(n) coordinates is about orthogonal to the
"true" vector
- So the truth can not be reconstructed via any method.
Readings:
- The
Johnson-Lindenstrauss lemma says that any nearest neighbor
algorithm can be reduced to using only log(n) dimensions. This can be
viewed as a good thing (it speeds up nearest neighbor calculations).
Or it can be viewed as a bad thing (it shows how little information
nearest neighbor actually ends up using).
- For another gentle introduction see Dimitris
Achlioptas' paper called "Database friendly projections".
dean@foster.net