Last modified: Tue Sep 27 12:16:59 EDT 2005 by Dean Foster

Statistical Data mining: High dimensions (part 2)

Admistrivia

New homework 1

Lecture overview

Clearly regression doesn't extend to p > n
How about nearest neighbor?
This will be the meat of todays lecture

Model

Model: Truth is a linear function. I.e. Y = X beta.

No error.
So theorem from last time says: nearest neighbor should be perfect.
Certainly true using classical statistics asymptotics
draw one-d picture

2-d example

2-d plot. X axis has signal, Y no signal
three cases: Y no spread, Y simillar spread, Y very spread out axis
In very spread out case, nearest neighbor is no better than chance at getting a good fit.

2-d heuristic for d-dimensional problem

2-d plot. X axis is signal, Y axis shows orthognal component
most points have distance about sqrt(d) from each other on y axis
Suppose 2-d plot was correct. Draw picture. If n < sqrt(d) nearest neighbor is no better than chance at getting a good fit.

How close are nearest neighbors?

distance squared to closest point is about d - sqrt(4 d log n)
- Proof: write as sums of squares
- squared distance then is normal
- mean = d, variance = 2d.
- Tails of a normal say extreme value is about sqrt(2 log n) from mean in SD units
"close" if d is approximately equal to log n. (CLT breaks down here, so above approximation fails and thus avoids negative distances.)

Intuition via Johnson-Lindenstrauss lemma

In 1984, Johnson-Lindenstrauss proved that you can approximate the distances between n points living in d dimensions using about log(n) dimensions.
Since nearest neighbor only needs these distances, it can be so approximated.
These log(n) dimensions can be picked randomly. (See Dimitris Achlioptas' paper)
Each of these log(n) coordinates is about orthogonal to the "true" vector
So the truth can not be reconstructed via any method.

Readings:

The Johnson-Lindenstrauss lemma says that any nearest neighbor algorithm can be reduced to using only log(n) dimensions. This can be viewed as a good thing (it speeds up nearest neighbor calculations). Or it can be viewed as a bad thing (it shows how little information nearest neighbor actually ends up using).
For another gentle introduction see Dimitris Achlioptas' paper called "Database friendly projections".

dean@foster.net