• What to do with all those clusters?
• If clusters are "natural kinds" then additive models of clusters make sense.
• Can add them to a regression (as indicator variables)
• Not all that interesting if only one set of clusters
• But, if there are many sets of clusters, additive models could generate interesting models.
• Alternatively, can use clusters in a tree.
• Suppose there are cluster types A,B,...,C
• Draw tree built out of these.
• Trees of natural kinds is motivation for:
• CART
• ID3
• C4.5 / C5.0
• MARS
• Context trees

# Trees

• Too expensive to run clustering at every step?
• If the natural kinds are clean enough, maybe only one variable would be enough.
• Motivation for CART
• Build tree using splits on each X variable

## Forecasting with a tree

• fits is the average in the node
• Draw picture: Note it is kinda bumpy
• In leaf nodes, use average (This is the description for CART)
• In leaf node, use most popular value in classification (This is the idea for C4.5)

## How many trees are there

• p choices for top. n split points. So np total top nodes.
• Each child node (there are 2) has basically same picture.
• If depth is d, there are 2d nodes each with pn choices.
• If d = log(p), this is same size as all regression models

## Need heuristic search

• Heuristic function: current fit
• So pick best fit at each split
• where "Best" is defined to be LS for example

## Pruning trees

• As a forester (3nd generation) I know pruning trees is important
• Easy to consider lots of comparisions
• Allows seperation of searching from purning
• All one needs is a good fitting function
• See for example: ( Variable Length Markov Chains. P. Buhlmann and A.J. Wyner, The Annals of Statistics, Vol. 27, No. 2, pp. 480-513, 1999.)

## Random forests

• Suppose truth is additive model in k variables
• Requires k parameters in usual version of regression
• requires 2k parameters (each based on only n2-k data points) to fit a tree
• Our solution: drop a bunch of trees into a regression
• Existing solution: average over a bunch of trees: called random forests