Last modified: Tue Nov 22 14:30:52 EST 2005
by Dean Foster
Admistrivia
Least squares using RKHS
- You say MAP, I say LS.
- Optimizing MAP is a form of LS
- as functions: Sum(Y - f(x))2 + lambda |f|2
- as regression: Sum(Y - K alpha)2 + lambda alpha K alpha
Statistical Data mining: Clustering
Supervised vs. unsupervised
- Regression is supervised: if you get it wrong you know!
- Clustering is unsupervised: you have your clusters, I have
mine, which are better? Who knows?
- Still important EVEN if we can't tell if we are doing it well
Natural kinds
- A natural
kind is a set of objects that have truely similar properties.
- Should hold across time
- Hence: useful for prediction
- Examples: raven, duck, black, white
- Non-examples: cool, "pink is the new black" (as in
fashion--neither pink nor black are natural kinds here), grue
- Talk about grue for a while:
- Incorrect definition of a grue: "It is pitch black. You are
likely to be eaten by a grue." (Zork)
- Looks green before 2000, looks like blue after 2001.
- From Calvin
and Hobbs (10/29/89)
- C: Dad, how come old photographs are always black and white? Didn't
they
have color film back then?
- D: Sure they did. In fact, those old photographs ARE in color. It's
just the
WORLD was black and white then.
- C: Really?
- D: Yep. The world didn't turn color until sometime in the
1930s, and it was pretty grainy color for a while, too.
- C: That's really weird.
- D: Well, truth is stranger than fiction.
- C: But then why are old PAINTINGS in color?! If the world was black
and
white, wouldn't artists have painted it that way?
- D: Not necessarily. A lot of great artists were insane.
- C: But... but how could they have painted in color anyway? Wouldn't
their
paints have been shades of gray back then?
- D: Of course, but they turned colors like everything else in the '30s.
- C: So why didn't old black and white photos turn color too?
- D: Because they were color pictures of black and white, remember?
- Goal of clustering is to find natural kinds
Good natural kinds should be easy to seperate
- As usual, think high dimensionally
- Try separating cats from dogs.
- weight
- fuzziness of tail
- friendlyness
- food likes
- etc
- No property may generate perfect separation
- But, they are still many 100s of SD apart
- Disjoint when considering: weight - fuzzy + friendly + ...
Blackboards are a bad model
- Three groups in 2-D is not good model
- Instead, 3 groups, in 100-D but now hard to draw
- But now we can project the 3 groups back down to 2-D
K-means algorithm
- Randomaly assign groups
- Compute center of each group
- Go back and reassign to closes group
- repeat
Optimization
- Hill climbs up sum sum (x - center)2
- Only shows it won't cycle
- But since we don't have a loss function anyway, no need to
actually optimize it.
Testing the clusters
Natural kinds don't need perfect algorithms
Natural kinds should be well seperated. All other clusters are
sperious. So this really is a CS problem rather than a statistical
signfiicance problem.
- No need to optimally subselect features since there should be
good seperation in many directions.
- No need for carefully dealing with the boundary.
- No need for soft clustering
dean@foster.net