310 likes | 375 Vues
Multiple regression. Regression. Problem: to draw a straight line through the points that best explains the variance. Regression. Problem: to draw a straight line through the points that best explains the variance. Regression.
E N D
Regression Problem: to draw a straight line through the points that best explains the variance
Regression Problem: to draw a straight line through the points that best explains the variance
Regression Problem: to draw a straight line through the points that best explains the variance
Regression Test with F, just like ANOVA: Variance explained by x-variable / df Variance still unexplained / df Variance explained (change in line lengths2) Variance unexplained (residual line lengths2)
Regression Test with F, just like ANOVA: Variance explained by x-variable / df Variance still unexplained / df In regression, each x-variable will normally have 1 df
Regression Test with F, just like ANOVA: Variance explained by x-variable / df Variance still unexplained / df Essentially a cost: benefit analysis – Is the benefit in variance explained worth the cost in using up degrees of freedom?
Regression Also have R2: the proportion of total variance explained by the variable Variance explained by x-variable Variance still unexplained Unexplainedvariance Variance explainedby x-variable
Regression example Total variance for 32 data points is 300 units. An x-variable is then regressed against the data, accounting for 150 units of variance. • What is the R2? • What is the F ratio?
Regression example Total variance for 32 data points is 300 units. An x-variable is then regressed against the data, accounting for 150 units of variance. • What is the R2? • What is the F ratio? R2 = 150/300 = 0.5 F 1,30 = 150/1 = 30 150/30 Why is df error = 30?
Higher nutrient trees Lower nutrient trees Multiple regression Herbivore damage Tree age Damage= m1*age + b
Herbivore damage Tree age Residuals of herbivore damage Tree nutrient concentration
Damage= m1*age + m2*nutrient + b Herbivore damage Tree age Residuals of herbivore damage Tree nutrient concentration
No interaction (additive): Interaction (non-additive): y y Damage= m1*age + m2*nutrient + m3*age*nutrient +b
X1 X2 Non-linear regression? Just a special case of multiple regression! X X2 Y 1 1 1.1 2 4 2.0 3 9 3.6 4 16 3.1 5 25 5.2 6 36 6.7 7 49 11.3 Y = m1 x +m2 x2 +b Y = m1 x1 +m2 x2 +b
Jump height (how high ball can be raised off the ground) 8 9 10 11 Feet off ground Total SS = 11.11
X variable parameter SS F1,13 p Height +0.943 9.96 112 <0.0001 of player
X variable parameter SS p Weight +0.040 7.92 32 <0.0001 of player F1,13
An idea Perhaps if we took two people of identical height, the lighter one might actually jump higher? Excess weight may reduce ability to jump high…
X variable parameter SS F p Height +2.133 9.956 803 <0.0001 Weight -0.059 1.008 81 <0.0001 lighter heavier
X variable parameter SS F p Height +2.133 9.956 803 <0.0001 Weight -0.059 1.008 81 <0.0001 X variable parameter SS p Weight +0.040 7.92 32 <0.0001 of player • Why did the parameter estimates change? • Why did the F tests change? F1,13
Tall people can jump higher Heavy people often tall (tall people often heavy) + Height Jump + - Weight People light for their height can jump a bit more
The problem: The parameter estimate and significance of an x-variable is affected by the x-variables already in the model! How do we know which variables are significant, and which order to enter them in model?
Solutions 1) Use a logical order. For example it makes sense to test the interaction first 2) Stepwise regression: “tries out” various orders of removing variables.
Stepwise regression Enters or removes variables in order of significance, checks after each step if the significance of other variables has changed Enters one by one: forward stepwise Enters all, removes one by one: backwards stepwise
Forward stepwise regression • Enter the variable with the highest correlation with y-variable first (p>p enter). • Next enter the variable to explain the most residual variation (p>p enter). • Remove variables that become insignificant (p> p leave) due to other variables being added. And so on…
General words of caution! • Correlation does not equal causation!
General words of caution! • Can interpolate between points, but don’t extraoplate (Mark Twain effect) In the space of 176 the lower Mississippi has shortened itself 242 miles. That is an average of a trifle over 1 1/3 miles per year. Therefore, any calm person, who is not blind or idiotic, can see that in the old Oölithic Silurian Period, just a million years ago next November, the Lower Mississippi River was upwards of 1,300,000 miles long, and stuck out over the Gulf of Mexico like a fishing rod