Last modified: Thu Nov 17 14:54:29 EST 2005 by Dean Foster

Statistical Data mining: LS using RKHS

Admistrivia

I'm still missing a fair bunch of homework

Least squares using RKHS

In general, spans whole space
Hence, y = y-hat at observed points
How about at other points?
Requires looking at constrained minimization: minimize beta² such that (y - x*beta)² < target
So how does it do for near by points?
Draw some polynomial pictures as examples
YIKES! Thats not good.

Regularization

Do Legrandian of above: minimize beta² + lambda(SSE - target)
Equavalently: minimize SSE*sigma^-2 + beta²*tau^-2
Sometimes called the MAP estimator (Maximum Apostory) in Bayesian world view
- Suppose beta is normal(0,tau²)
- Joint distribution is: P(Y|beta)P(beta)
- Maximizing this over beta generates result
- Do maximization: beta-hat = MLE/(1 + sigma²/tau²)
Version of shrinkage.
NOTE: distance function makes big difference

Optimal regularization

Best tau is |beta|
- PROOF: minimize E(alpha Y - mu)²
Hence we need to estimate |beta|
Bayesian method: Place a prior over it
Unbiased estimate: |beta-hat|² - p sigma²

Regularizing over subspaces

Y = beta X1 + gamma X2
X1, X2 both RKHS or other large spaces
No reason to believe that beta and gamma are same size
So estimate size seperately.

Why not divide them up even more?

As you divide up the X's in to smaller groups, estimating the shrinkage parameter gets harder
Increase in quadradic error is by a factor of 1 + 1/d, where d is the number of dimension in a bin
Potential improvement is factor of 1/k, where k is number of bins.
This is the idea of Tony's "Block coding" in wavelet domain

Limit: dividing up in to single variables

Stepwise regression once again
Inefficient if we can find strength to borrow from
But "only" off by a constant factor

dean@foster.net