Last modified: Tue Nov 29 14:53:10 EST 2005
by Dean Foster
Admistrivia
- What to do with all those clusters?
- If clusters are "natural kinds" then additive models of
clusters make sense.
- Can add them to a regression (as indicator variables)
- Not all that interesting if only one set of clusters
- But, if there are many sets of clusters, additive models could
generate interesting models.
- Alternatively, can use clusters in a tree.
- Suppose there are cluster types A,B,...,C
- Draw tree built out of these.
- Trees of natural kinds is motivation for:
- CART
- ID3
- C4.5 / C5.0
- MARS
- Context trees
Trees
- Too expensive to run clustering at every step?
- If the natural kinds are clean enough, maybe only one
variable would be enough.
- Motivation for CART
- Build tree using splits on each X variable
Forecasting with a tree
- fits is the average in the node
- Draw picture: Note it is kinda bumpy
- In leaf nodes, use average (This is the description for CART)
- In leaf node, use most popular value in classification (This
is the idea for C4.5)
How many trees are there
- p choices for top. n split points. So np total top nodes.
- Each child node (there are 2) has basically same picture.
- If depth is d, there are 2d nodes each with pn
choices.
- If d = log(p), this is same size as all regression models
Need heuristic search
- Heuristic function: current fit
- So pick best fit at each split
- where "Best" is defined to be LS for example
Pruning trees
- As a forester (3nd generation) I know pruning trees is
important
- Easy to consider lots of comparisions
- Allows seperation of searching from purning
- All one needs is a good fitting function
- See for example: ( Variable Length
Markov Chains. P. Buhlmann and A.J. Wyner, The Annals of
Statistics, Vol. 27, No. 2, pp. 480-513, 1999.)
Random forests
- Suppose truth is additive model in k variables
- Requires k parameters in usual version of regression
- requires 2k parameters (each based on only
n2-k data points) to fit a tree
- Our solution: drop a bunch of trees into a regression
- Existing solution: average over a bunch of trees: called
random forests
dean@foster.net