## Least squares using RKHS

• You say MAP, I say LS.
• Optimizing MAP is a form of LS
• as functions: Sum(Y - f(x))2 + lambda |f|2
• as regression: Sum(Y - K alpha)2 + lambda alpha K alpha

# Statistical Data mining: Clustering

## Supervised vs. unsupervised

• Regression is supervised: if you get it wrong you know!
• Clustering is unsupervised: you have your clusters, I have mine, which are better? Who knows?
• Still important EVEN if we can't tell if we are doing it well

## Natural kinds

• A natural kind is a set of objects that have truely similar properties.
• Should hold across time
• Hence: useful for prediction
• Examples: raven, duck, black, white
• Non-examples: cool, "pink is the new black" (as in fashion--neither pink nor black are natural kinds here), grue
• Talk about grue for a while:
• Incorrect definition of a grue: "It is pitch black. You are likely to be eaten by a grue." (Zork)
• Looks green before 2000, looks like blue after 2001.
• From Calvin and Hobbs (10/29/89)
• C: Dad, how come old photographs are always black and white? Didn't they have color film back then?
• D: Sure they did. In fact, those old photographs ARE in color. It's just the WORLD was black and white then.
• C: Really?
• D: Yep. The world didn't turn color until sometime in the 1930s, and it was pretty grainy color for a while, too.
• C: That's really weird.
• D: Well, truth is stranger than fiction.
• C: But then why are old PAINTINGS in color?! If the world was black and white, wouldn't artists have painted it that way?
• D: Not necessarily. A lot of great artists were insane.
• C: But... but how could they have painted in color anyway? Wouldn't their paints have been shades of gray back then?
• D: Of course, but they turned colors like everything else in the '30s.
• C: So why didn't old black and white photos turn color too?
• D: Because they were color pictures of black and white, remember?
• Goal of clustering is to find natural kinds

## Good natural kinds should be easy to seperate

• As usual, think high dimensionally
• Try separating cats from dogs.
• weight
• fuzziness of tail
• friendlyness
• food likes
• etc
• No property may generate perfect separation
• But, they are still many 100s of SD apart
• Disjoint when considering: weight - fuzzy + friendly + ...

## Blackboards are a bad model

• Three groups in 2-D is not good model
• Instead, 3 groups, in 100-D but now hard to draw
• But now we can project the 3 groups back down to 2-D

## K-means algorithm

• Randomaly assign groups
• Compute center of each group
• Go back and reassign to closes group
• repeat

## Optimization

• Hill climbs up sum sum (x - center)2
• Only shows it won't cycle
• But since we don't have a loss function anyway, no need to actually optimize it.

## Testing the clusters

• If the clusters are good they should predict other features that weren't used in clustering (say number of tandem repeats in DNA of cats vs dogs)

Nice theory by Tali Tishby in his talk "Feature complexity and generalization - the missing dimension of learning?"

• Good clusters should make better features for supervised learning. (Abishek is working on this).
• The total variablility should go down a lot. More than one would expect by chance alone. BUT, how much should it go down by chance?
• consider multivariate normal that is noise (dim = d, number clusters = k)
• Easier to analyse if we use exemplar rather than actual mean (Sometimes called k-median)
• if d > > k: then no improvement for noise
• if d = log(k): then massive overfitting
• But this is effective dimension: So one large eigenvalue would make it effectively 1 dimention. So overfitting would occur.

## Natural kinds don't need perfect algorithms

Natural kinds should be well seperated. All other clusters are sperious. So this really is a CS problem rather than a statistical signfiicance problem.
• No need to optimally subselect features since there should be good seperation in many directions.
• No need for carefully dealing with the boundary.
• No need for soft clustering

dean@foster.net