Last modified: Tue Sep 27 12:16:59 EDT 2005 by Dean Foster

# Statistical Data mining: High dimensions (part 2)

• New homework 1

## Lecture overview

• Clearly regression doesn't extend to p > n
• How about nearest neighbor?
• This will be the meat of todays lecture

## Model

Model: Truth is a linear function. I.e. Y = X beta.
• No error.
• So theorem from last time says: nearest neighbor should be perfect.
• Certainly true using classical statistics asymptotics
• draw one-d picture

## 2-d example

• 2-d plot. X axis has signal, Y no signal
• three cases: Y no spread, Y simillar spread, Y very spread out axis
• In very spread out case, nearest neighbor is no better than chance at getting a good fit.

## 2-d heuristic for d-dimensional problem

• 2-d plot. X axis is signal, Y axis shows orthognal component
• most points have distance about sqrt(d) from each other on y axis
• Suppose 2-d plot was correct. Draw picture. If n < sqrt(d) nearest neighbor is no better than chance at getting a good fit.

## How close are nearest neighbors?

• distance squared to closest point is about d - sqrt(4 d log n)
• Proof: write as sums of squares
• squared distance then is normal
• mean = d, variance = 2d.
• Tails of a normal say extreme value is about sqrt(2 log n) from mean in SD units
• "close" if d is approximately equal to log n. (CLT breaks down here, so above approximation fails and thus avoids negative distances.)

## Intuition via Johnson-Lindenstrauss lemma

• In 1984, Johnson-Lindenstrauss proved that you can approximate the distances between n points living in d dimensions using about log(n) dimensions.
• Since nearest neighbor only needs these distances, it can be so approximated.
• These log(n) dimensions can be picked randomly. (See Dimitris Achlioptas' paper)
• Each of these log(n) coordinates is about orthogonal to the "true" vector
• So the truth can not be reconstructed via any method.