So if regression doesn't prove causation, what does?
- Administrivia:
- Data analysis is due a week today
- Extra office hours by Nuria: 10 - 1 May 26 and 27th.
- Extra office hours by Michael: 10 - 12 Monday May 17th
and 10 - 12 on Thursday May 20th.
Getting started on your data analysis
- Histogram each variable (check for outliers, get to know your data)
- Scatter plot your continuous variables
- look for obvious relationships (age and year)
- look for high correlations
- check Mahmanovous distance for "multidimentional" outliers
- Edit data if necessary (you might have found some strange points)
- color code for some of the more important categorial variables (pink/blue say).
Look at the previous histograms and scatter plots with the new colors--see if anything striking happens.
- Start multiple regression and run full model
- kitchen sink model (throw everything that is reasonable into the model)
- look at leverage plots to see if colinearity is a problem
- look at VIF
- Build a reasonable model
- find variables that you can explain
- keep it simple
- You might not find anything significant at all. (Since isn't always fun.)
- check your residuals (linearity via Y vs Y-hat plot), normality via a historgram and a normal probabiltiy plot, hetroskadasticy by looking at residuals vs your important X variables.
- Build another model
- repeat!
Observational studies vs controled experiments
- Observational studies
- An observation study is one where you watch the world around you
- Most of science progress this way
- Examples: Astrophysics, traditional evolution, economics, sociology
- Correct inferences can be made--but they must be made timidly
- Recall our diagrams--one of those could always be the right story
- controled Experiments
- Manipulate the world and see the result
- The definition of the scientific method
- Examples: chemistry, psychology, microbiology, modern genetic
- Statements of causation can be made with confidence (if
randomization is done)
- Why are controled experiments stronger?
- Suppose we observe X --> Y, then there might exist a Z
correlated with X, and Z is also correlated Y.
- Suppose we control X ourselves and see X --> Y. Could
there be a Z that is corrlated with X? YES! If we don't
randomize.
- Example: Stanford heart transplant data. Picked healthy
subjects for surgery and frail subjects for "control
group." The healthy subjects did better--sorry the
surgery did better. :-)
- So controled experiments WITHOUT randomization aren't
any better than observational studies
- But, if X is randomized from the outcome of a coin toss,
then X will be uncorrelated with Z.
- So our diagram is X is unrelated to Z. So if X is
correlated with Y, it is based on its own merits.
- These sorts of setting are the domain of ANOVA
- A key advantage of regression is that you can study 20
x-planitory variables all at once
- In a simple controled experiment you can only do 1
x-planitory variable at a time--and that variable can
only have two levels.
- ANOVA is a way of fixing that
- Randomly assign people into treatment cells of a 2 way
table.
- Run regression (aka ANOVA)
- If you find a ROW effect or a COL effect you know that
assignemtn to ROW/COL is random and hence uncorrelated
with any possible lurking variable.
- So many row and many column effects can be tested at the
same time. This can save lots of time!
Last modified: Thu Apr 13 13:10:42 2000