Class 1. Stat701

Concise regression review notes are available from Stat608 1997 homepage.

Today.

*
Review of simple linear regression.
*
New idea. Heavy tailed residual distributions and what they may indicate.
*
Extending categorical variables to "broken stick" models.


Illustrations: 30 years of returns on gold.

The model for the mean relationship:

displaymath75

The model for the raw data:

displaymath77

This is the straight line or linear model.

Assumptions are mostly on the tex2html_wrap_inline79 .

*
Independent
*
Constant variance. Mean zero.
*
Approximately normally distributed.

Biggest problems. Dependence, skewness and non-constant variance.

*
Call the tex2html_wrap_inline79 the "true error terms".
*
Distance from point to the true line. tex2html_wrap_inline83 .
*
We don't know them as we don't know the regression line.
*
Substitute with the "residuals", estimated error terms.
*
Distance from point to estimated regression line. tex2html_wrap_inline85 .

ALWAYS check assumptions on the residuals.

Why so important?

*
Standard least squares regression is sensitive to individual data points.
*
A single point can dominate the regression.
*
Everything you say and conclude may be driven by a single data point.
*
Residual plots are one of the tools available to help identify these points.
*
Even if you keep it, it is important to know that it is there.
*
Inference, p-values, CI's etc only have validity if assumptions hold.

Key diagnostics.

1. Residual plot. Good plots lack structure.

2. Normal scores plot of the residuals.

More depth.

What's behind a normal scores plot?

Most model diagnostics (here the model is the normal distribution for the error terms) compare reality (what we observe) to theory (what we expect). In general OBSERVED versus EXPECTED.

This is how the normal scores plot is constructed.

*
On the X-axis is what we expect.
*
On the Y axis is what we observe.
*
The idea is simple: say there were 100 observations (n = 100) and therefore 100 residuals.
*
The model says that the residuals come from an approximate normal distribution.
*
Now order the residuals from lowest to highest.
*
Where would you EXPECT the smallest of 100 observations from a NORMAL distribution to lie?
*
Plot where you expect it to be against where it actually is.
*
Repeat for the other 99 points.
*
If the model is correct than theory and reality should coincide, ie observed equals expected and the points should roughly (because there's inherent variability) lie along a line.

Extension.

There is no reason why for the X-axis we have to use the normal distribution, perhaps the data has a gamma distribution (useful for life length data). You just calculate where you EXPECT the data to be if a gamma distribution is true. These more general plots are called " Quantile-Quantile plots" or Q-Q plots.

Heavy tailed residuals.

The ends of the normal scores plot have greater slopes than the reference line because the observations in the tails are spreading out more than the normal theory predicts.

One reason for heavy tails. The residuals come from TWO groups with different variances. Always leads to heavy tails. Interpretation: two different volatility regimes, low and high.

Graphical observation generates sensible questions.


Categorical variable regression.

Enables comparisons between groups while accounting for other related variables - Amazonian Indian Stress Study.

First case: a single dichotomous variable. Our example: Pre 1980 vs Post 1980.

The way JMP does it: model

displaymath87

where z = 1 if observation is in the first group and -1 if observation is in the second group.

Check to understand the model: plug in z = 1 and -1.

Group 1 model

displaymath89

displaymath91

displaymath93

Group 2 model

displaymath95

displaymath97

displaymath99

Compare Group 1 and Group 2.

Av(Y|x,z=1) - Av(Y|x,z=-1) is the difference in height between the two regression lines.

Notes.

*
Both groups have the same slopes ( tex2html_wrap_inline101 ) - parallel lines.
*
The difference in heights is tex2html_wrap_inline103 .
*
Recognize that tex2html_wrap_inline101 represents a comparison against the "norm".

There are many different types of coding schemes for categorical variables. We will investigate them in more detail.

Example from the Gold data set.

Consider Q-Q plot of one set of residuals against the other.

On to interaction in categorical variables (non-parallel lines).


Broken stick regression.

Useful for systems that may suffer a "shock".

Another application of categorical variables.

Model:

displaymath107

where z = 0 if x < T and z = 1 if tex2html_wrap_inline111 , and T is the "breakpoint".

Case 1, x < T, plug in to get

displaymath115

displaymath117

Case 2, tex2html_wrap_inline111 , plug in to get

displaymath121

displaymath123

displaymath125

Slope before T is tex2html_wrap_inline101 , slope after T is tex2html_wrap_inline127 .

Therefore tex2html_wrap_inline129 measures the change in slope after time T.

Implementation in JMP.

*
1. Create the categorical variable column z.
*
2. Create a new column x - T.
*
3. Create a new column by multiplying column z by column (x - T).
*
4. Run the regression with column x and the "product column".

Technical name: Piecewise linear regression.

Issues: searching for the breakpoint.



Richard Waterman
Mon Sep 8 22:11:51