Class 1. Stat701

Concise regression review notes are available from Stat608 1997 homepage.

Today.

: Review of simple linear regression.
: New idea. Heavy tailed residual distributions and what they may indicate.
: Extending categorical variables to "broken stick" models.

Illustrations: 30 years of returns on gold.

The model for the mean relationship:

The model for the raw data:

This is the straight line or linear model.

Assumptions are mostly on the .

: Independent
: Constant variance. Mean zero.
: Approximately normally distributed.

Biggest problems. Dependence, skewness and non-constant variance.

: Call the the "true error terms".
: Distance from point to the true line. .
: We don't know them as we don't know the regression line.
: Substitute with the "residuals", estimated error terms.
: Distance from point to estimated regression line. .

ALWAYS check assumptions on the residuals.

Why so important?

: Standard least squares regression is sensitive to individual data points.
: A single point can dominate the regression.
: Everything you say and conclude may be driven by a single data point.
: Residual plots are one of the tools available to help identify these points.
: Even if you keep it, it is important to know that it is there.
: Inference, p-values, CI's etc only have validity if assumptions hold.

Key diagnostics.

1. Residual plot. Good plots lack structure.

2. Normal scores plot of the residuals.

More depth.

What's behind a normal scores plot?

Most model diagnostics (here the model is the normal distribution for the error terms) compare reality (what we observe) to theory (what we expect). In general OBSERVED versus EXPECTED.

This is how the normal scores plot is constructed.

On the X-axis is what we expect.

On the Y axis is what we observe.

The idea is simple: say there were 100 observations (n = 100) and therefore 100 residuals.

: The model says that the residuals come from an approximate normal distribution.
: Now order the residuals from lowest to highest.
: Where would you EXPECT the smallest of 100 observations from a NORMAL distribution to lie?
: Plot where you expect it to be against where it actually is.
: Repeat for the other 99 points.
: If the model is correct than theory and reality should coincide, ie observed equals expected and the points should roughly (because there's inherent variability) lie along a line.

Extension.

There is no reason why for the X-axis we have to use the normal distribution, perhaps the data has a gamma distribution (useful for life length data). You just calculate where you EXPECT the data to be if a gamma distribution is true. These more general plots are called " Quantile-Quantile plots" or Q-Q plots.

Heavy tailed residuals.

The ends of the normal scores plot have greater slopes than the reference line because the observations in the tails are spreading out more than the normal theory predicts.

One reason for heavy tails. The residuals come from TWO groups with different variances. Always leads to heavy tails. Interpretation: two different volatility regimes, low and high.

Graphical observation generates sensible questions.