Last modified: Thu Nov 17 14:54:29 EST 2005
by Dean Foster
Statistical Data mining: LS using RKHS
Admistrivia
- I'm still missing a fair bunch of homework
Least squares using RKHS
- In general, spans whole space
- Hence, y = y-hat at observed points
- How about at other points?
- Requires looking at constrained minimization: minimize
beta2 such that (y - x*beta)2 < target
- So how does it do for near by points?
- Draw some polynomial pictures as examples
- YIKES! Thats not good.
Regularization
- Do Legrandian of above: minimize beta2 + lambda(SSE
- target)
- Equavalently: minimize SSE*sigma-2 +
beta2*tau-2
- Sometimes called the MAP estimator (Maximum Apostory) in
Bayesian world view
- Suppose beta is normal(0,tau2)
- Joint distribution is: P(Y|beta)P(beta)
- Maximizing this over beta generates result
- Do maximization: beta-hat = MLE/(1 + sigma2/tau2)
- Version of shrinkage.
- NOTE: distance function makes big difference
Optimal regularization
- Best tau is |beta|
- PROOF: minimize E(alpha Y - mu)2
- Hence we need to estimate |beta|
- Bayesian method: Place a prior over it
- Unbiased estimate: |beta-hat|2 - p sigma2
Regularizing over subspaces
- Y = beta X1 + gamma X2
- X1, X2 both RKHS or other large spaces
- No reason to believe that beta and gamma are same size
- So estimate size seperately.
Why not divide them up even more?
- As you divide up the X's in to smaller groups, estimating the
shrinkage parameter gets harder
- Increase in quadradic error is by a factor of 1 + 1/d, where d
is the number of dimension in a bin
- Potential improvement is factor of 1/k, where k is number of
bins.
- This is the idea of Tony's "Block coding" in wavelet domain
Limit: dividing up in to single variables
- Stepwise regression once again
- Inefficient if we can find strength to borrow from
- But "only" off by a constant factor
dean@foster.net