So if regression doesn't prove causation, what does?

Administrivia:
- Data analysis is due a week today
- Extra office hours by Nuria: 10 - 1 May 26 and 27th.
- Extra office hours by Michael: 10 - 12 Monday May 17th and 10 - 12 on Thursday May 20th.

Getting started on your data analysis

Histogram each variable (check for outliers, get to know your data)
Scatter plot your continuous variables
- look for obvious relationships (age and year)
- look for high correlations
- check Mahmanovous distance for "multidimentional" outliers
Edit data if necessary (you might have found some strange points)
color code for some of the more important categorial variables (pink/blue say). Look at the previous histograms and scatter plots with the new colors--see if anything striking happens.
Start multiple regression and run full model
- kitchen sink model (throw everything that is reasonable into the model)
- look at leverage plots to see if colinearity is a problem
- look at VIF
Build a reasonable model
- find variables that you can explain
- keep it simple
- You might not find anything significant at all. (Since isn't always fun.)
check your residuals (linearity via Y vs Y-hat plot), normality via a historgram and a normal probabiltiy plot, hetroskadasticy by looking at residuals vs your important X variables.
Build another model
repeat!

Observational studies vs controled experiments

Observational studies
- An observation study is one where you watch the world around you
- Most of science progress this way
- Examples: Astrophysics, traditional evolution, economics, sociology
- Correct inferences can be made--but they must be made timidly
- Recall our diagrams--one of those could always be the right story
controled Experiments
- Manipulate the world and see the result
- The definition of the scientific method
- Examples: chemistry, psychology, microbiology, modern genetic
- Statements of causation can be made with confidence (if randomization is done)
Why are controled experiments stronger?
- Suppose we observe X --> Y, then there might exist a Z correlated with X, and Z is also correlated Y.
- Suppose we control X ourselves and see X --> Y. Could there be a Z that is corrlated with X? YES! If we don't randomize.
- Example: Stanford heart transplant data. Picked healthy subjects for surgery and frail subjects for "control group." The healthy subjects did better--sorry the surgery did better. :-)
- So controled experiments WITHOUT randomization aren't any better than observational studies
- But, if X is randomized from the outcome of a coin toss, then X will be uncorrelated with Z.
- So our diagram is X is unrelated to Z. So if X is correlated with Y, it is based on its own merits.
These sorts of setting are the domain of ANOVA
- A key advantage of regression is that you can study 20 x-planitory variables all at once
- In a simple controled experiment you can only do 1 x-planitory variable at a time--and that variable can only have two levels.
- ANOVA is a way of fixing that
- Randomly assign people into treatment cells of a 2 way table.
- Run regression (aka ANOVA)
- If you find a ROW effect or a COL effect you know that assignemtn to ROW/COL is random and hence uncorrelated with any possible lurking variable.
- So many row and many column effects can be tested at the same time. This can save lots of time!

Last modified: Thu Apr 13 13:10:42 2000