Last modified: Tue Nov 22 14:30:52 EST 2005 by Dean Foster

Admistrivia

Least squares using RKHS

You say MAP, I say LS.
Optimizing MAP is a form of LS
as functions: Sum(Y - f(x))² + lambda |f|²
as regression: Sum(Y - K alpha)² + lambda alpha K alpha

Statistical Data mining: Clustering

Supervised vs. unsupervised

Regression is supervised: if you get it wrong you know!
Clustering is unsupervised: you have your clusters, I have mine, which are better? Who knows?
Still important EVEN if we can't tell if we are doing it well

Natural kinds

A natural kind is a set of objects that have truely similar properties.
Should hold across time
Hence: useful for prediction
Examples: raven, duck, black, white
Non-examples: cool, "pink is the new black" (as in fashion--neither pink nor black are natural kinds here), grue
Talk about grue for a while:
- Incorrect definition of a grue: "It is pitch black. You are likely to be eaten by a grue." (Zork)
- Looks green before 2000, looks like blue after 2001.
- From Calvin and Hobbs (10/29/89)
  - C: Dad, how come old photographs are always black and white? Didn't they have color film back then?
  - D: Sure they did. In fact, those old photographs ARE in color. It's just the WORLD was black and white then.
  - C: Really?
  - D: Yep. The world didn't turn color until sometime in the 1930s, and it was pretty grainy color for a while, too.
  - C: That's really weird.
  - D: Well, truth is stranger than fiction.
  - C: But then why are old PAINTINGS in color?! If the world was black and white, wouldn't artists have painted it that way?
  - D: Not necessarily. A lot of great artists were insane.
  - C: But... but how could they have painted in color anyway? Wouldn't their paints have been shades of gray back then?
  - D: Of course, but they turned colors like everything else in the '30s.
  - C: So why didn't old black and white photos turn color too?
  - D: Because they were color pictures of black and white, remember?
Goal of clustering is to find natural kinds

Good natural kinds should be easy to seperate

As usual, think high dimensionally
Try separating cats from dogs.
- weight
- fuzziness of tail
- friendlyness
- food likes
- etc
No property may generate perfect separation
But, they are still many 100s of SD apart
Disjoint when considering: weight - fuzzy + friendly + ...

Blackboards are a bad model

Three groups in 2-D is not good model
Instead, 3 groups, in 100-D but now hard to draw
But now we can project the 3 groups back down to 2-D

K-means algorithm

Randomaly assign groups
Compute center of each group
Go back and reassign to closes group
repeat

Optimization

Hill climbs up sum sum (x - center)²
Only shows it won't cycle
But since we don't have a loss function anyway, no need to actually optimize it.

Testing the clusters

If the clusters are good they should predict other features that weren't used in clustering (say number of tandem repeats in DNA of cats vs dogs)
Nice theory by Tali Tishby in his talk "Feature complexity and generalization - the missing dimension of learning?"
Good clusters should make better features for supervised learning. (Abishek is working on this).
The total variablility should go down a lot. More than one would expect by chance alone. BUT, how much should it go down by chance?
- consider multivariate normal that is noise (dim = d, number clusters = k)
- Easier to analyse if we use exemplar rather than actual mean (Sometimes called k-median)
- if d > > k: then no improvement for noise
- if d = log(k): then massive overfitting
- But this is effective dimension: So one large eigenvalue would make it effectively 1 dimention. So overfitting would occur.

Natural kinds don't need perfect algorithms

Natural kinds should be well seperated. All other clusters are sperious. So this really is a CS problem rather than a statistical signfiicance problem.

No need to optimally subselect features since there should be good seperation in many directions.
No need for carefully dealing with the boundary.
No need for soft clustering

dean@foster.net