STAT 541: Influence

# Statistics 541: Influence

• new homework problem

## Homework

Consider the following data on cleaning crews (cleaning.jmp). We expect that the number of rooms cleaned will be linear in the number of crews sent to clean them. In fact, we might even expect that if we send zero crews, we will get zero rooms cleaned. We want Y to be the number of rooms cleaned and X to be the number of crews sent out. How many rooms does each crew clean?
1. Plot the data, run a simple regression and create prediction bounds. Does the data appear homoskedastic?

2. Transform the data to do a weighted least squares. Try using both a standard deviation proportional to X and portional to squareroot of X. Plot both. Which appears to be more homoskedastic?

3. Use the White estimator (the sandwich estimator) to generate standard errors for the slope and intercept.

4. Discussion question: (Please type up a one page answer to the following.) Compare the confidence intervals for the slope in each of the methods above. Which ones do you believe? Are the ones that are theoretically wrong qualitatively wrong? Our theory suggests that the intercept should be zero. Which is the correct test to run? Do we fail to reject the null? Do any of the other test incorrectly reject the null?

5. Pick the weighted least squares model that appears to be the most homoskedastic. Now use the White estimator on that model. Does it change the SE's very much?

6. The envelope please: Add up all the rooms cleaned. Add up all the crews. Divide these two to come up with an average number of rooms cleaned per crew. This should match one of the slopes you computed above. (You could also compute a standard error by hand to see which confidence intervals above are the closest to describing the right answer. But you don't have to do this.)

## Leverage

• How much do betas depend on a single point? (this leads to p x n matrix of leverages)
• How much do predictions depend on a single point? (this leads a a vector of leverages)
• How much does prediction of i depend on value of i?

• prediction is Xi beta-hat.
• beta-hat = (X'X)-1X'Y
• prediction is Xi (X'X)-1X'Y
• To convert Y to a prediction: (X (X'X)-1X')Y
• hat matrix: h = (X (X'X)-1X')
• Called projection matrix. Or "hat matrix" since it puts a hat on y.
• hii is dy-hat/dy--called leverate

• Nice property: depends ONLY on X's. So if you decide to toss a point out based on its leverage, you aren't biasing your results.
• Use mahalobious to "see" leverage

## Influence = leverage x outlier

• A point that isn't leveraged doesn't effect outcome very much
• a point that isn't an outlier doesn't effect outcome very much
• need both to be influential
• draw pictures of various outliers: MBA call them, cottages, direct mail, and crime)

## Various definitions of influence

• DFFITS = change in forecast i if you leave out observation i

• (y-hati - yi,-i)/SE(y-hat)
• R-student sqrt(h/1-h)
• looking at squared values, and noting that h is approximately zero leads to influence = leverage x outlier concept

• DFBETAS = change if slope j if you leave out obseration i

• no easy formula
• no longer related to leverage (hatii)
• depends on scale of beta--change scale of X changes value of DFBETAS

• DFALL = Cooks D = change in all betas = change in all predictions

• use "natural" parameter space/prediction loss
• Cooks D is squared-distance change in adding point i
• tells how much parameters in natural basis
• tells how much all the predictions change on average
• r2h/(p(1-h))
• again justifies outlier = leverage x outlier