1 / 46

Lecture 10: Multiple Regression Hypothesis Testing (II)

This lecture explores multiple regression, dummy variables, multicollinearity, and hypothesis testing using two-sample t-tests in the context of pay equity. The lecture also discusses the use of categorical variables and creating dummy variables.

dratliff
Télécharger la présentation

Lecture 10: Multiple Regression Hypothesis Testing (II)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 10 Multiple Regression Hypothesis Testing (II)

  2. Outline • Multiple Regression • Dummy Variable • Multi-collinearity • Hypothesis Testing (II) - Two sample t-tests • Independent samples • Matched pairs BSB123 2016 Tommy Tang

  3. Multiple regression More than 1 independent variables: Y = f(X1, X2, X3…, e) where e is the error term. BSB123 2016 Tommy Tang

  4. Case Study – Pay Equity Recently the Equal Opportunity Commission (EOC) has received complaints in relation to pay discrimination against female employees at ABC. The EOC investigated the complaint. She conducted an independent t-test and found: Mean salary: F = $77550; M = $97650(p < 0.000) Conclusion? BSB123 2016 Tommy Tang

  5. Case Study – Pay Equity Trends in female/male average weekly earnings ratio (full-time) 1984 – 2008, Australia Source: ABS BSB123 2016 Tommy Tang

  6. EOC collects more data* … Education = Total numbers of years of education Experience = Total numbers of years of work experience * The data are created for illustrative purposes only. BSB123 2016 Tommy Tang

  7. Categorical Variable • Categorical variable is not measured on a quantitative scale. We cannot use the original categories (e.g. male, female) in regression. • Solution … use dummy variables • A dummy variable assumes only two values: 0 or 1. • “1” implies the case is in a particular category and “0” not. • We use one less dummythan the number of categories for any categorical variable. • The omitted dummy becomes the reference category. BSB123 2016 Tommy Tang

  8. Creating dummy variables Gender dummy Example: • Gender • Season Season dummies In the above, which is the reference variable? BSB123 2016 Tommy Tang

  9. The excel spreadsheet will look like this … We will now conduct a stepwise regression. BSB123 2016 Tommy Tang

  10. Entering Gender only • The Gender (male) premium = 20095 • It is equivalent to a two sample t-test. BSB123 2016 Tommy Tang

  11. Entering Gender and Education BSB123 2016 Tommy Tang

  12. Observation Predicted salary = b0 + b1*Gender + b2*Education = -9768 + 20621*Gender+ 5735*Educ • Each extra year of education adds $5737 to salary. • Adj R2 = 0.218 • After controlling for education, what can you say about the male advantage? (p=0.00) (p=0.00) BSB123 2016 Tommy Tang

  13. Graphical representation Predicted salary = -9768 + 20621*Gender+ 5735*Educ Male: y-hat = (-9768 + 20621)+ 5735*Educ Salary Female: y-hat = -9768 + 5735*Educ 10853 0 Educ -9768 Gender (male) premium = 20621 BSB123 2016 Tommy Tang

  14. Gender , Education & Experience BSB123 2016 Tommy Tang

  15. Observation ŷ = b0 + b1*Gender + b2*Education + b3*Experience = -16266 + 5455*Gender + 4594*Educ + 2329* Exp • Each extra year of education and experience adds $5737 and $2329 resply to salary. • Adj R2 improves significantly to 0.577 • After controlling for education & experience, the male advantageis no longer statistically significant (p-value = 0.195). (p=0.00) (p=0.00) (p=0.195) BSB123 2016 Tommy Tang

  16. Correlation Matrix (Salary, Education, Experience & Age) • What is the correlation between Age and Salary? • Do you expect Age to have significant influence on salary? • Do you expect adding Age will improve model fit? BSB123 2016 Tommy Tang

  17. Gender , Education, Experience & Age BSB123 2016 Tommy Tang

  18. Observation • Does AdjR2improve? • Gender, Education and Experience remain significant, but not Age. • Why is Age non-significant? BSB123 2016 Tommy Tang

  19. Multicollinearity • Multicollinearity occurs when independent variables are highly correlated. • Consequence: It increases the standard errors of regression coefficients to affected variables and make coefficients less significant (or become insignificant). • Solution: Exclude the offending variable. BSB123 2016 Tommy Tang

  20. Detecting Multicollinearity Note: Detecting multicollinearity is more complex than this example may suggest. Which variable do you think should be excluded? BSB123 2016 Tommy Tang

  21. The Final Model ŷ = b0 + b1*Gender + b2*Education + b3*Experience = -16266 + 5455*Gender + 4594*Educ + 2329* Exp BSB123 2016 Tommy Tang

  22. Case Study: Conclusion • After controlling for …Gender effect non-sig. • The gender difference could be because females on average have less work experience (10 vs 17) and perhaps lower job grades (?). • But then at a deeper level: Why do females have less work experience? • Perhaps this is the real source of gender discrimination, due to factors such as burden of child bearing. • Perhaps management is not advancing the females as quickly as it should, which naturally results in lower job grades & salaries for females. BSB123 2016 Tommy Tang

  23. Why Adjusted R Square? • If the sample size is small, you can improve the model fit (and R2) just by increasing the number of IV’s. E.g. If sample size (n) is 2, one IV (and one DV) will give a perfect fit. If n = 3, two IV’s will give a perfect fit … (Notice in these cases, the degree of freedom of residual = 0.) • But the additional IV’s may add no meaningful information to the model, yet R2 increases. • The adjusted R2 takes into account the additional information and reduced df an extra IV brings. BSB123 2016 Tommy Tang

  24. Adjusted R2 • Adjust the sum of squares in R2 with the corresponding degrees of freedom For reference only

  25. Steps: Create dummy variable for categorical variable Check for multicollinearity / outliers Model building Overall model adequacy Interpret regression coefficient (Prediction if required) Test regression assumptions (Bonus) Conclusion Conducting Multiple Regression BSB123 2016 Tommy Tang

  26. Two sample t-tests • Independent samples • Matched pairs BSB123 2016 Tommy Tang

  27. Case Study - Lightbulbs Many buy the more expensive brand energy saving lightbulbs in the belief that they last longer. The Consumer Council tested the lifetimes of a random sample (n = 10) of two brands (EXP & CHEEP). Is there any evidence that the more expensive Brand E lasts longer on average than C at 5%? BSB123 2016 Tommy Tang

  28. Known  One-sample test (recap) • In Lecture 8, we test the population mean against a hypothesised value. Eg: • The sample mean ( ) is the best estimator of . • follows a normal distribution if n is sufficiently large, with and • Test statistic: Ho:  = 200 gm H1:   200 gm BSB123 2016 Tommy Tang

  29. Useful rule for Variances • We can show that for any two RV’s, X& Y, If X & Y are independent: V(X + Y) = V(X) + V(Y) V(X Y) = V(X) + V(Y) BSB123 2016 Tommy Tang

  30. Useful rules about Variances • Recall: Given X N(, ). • Now given two independent normal RV’s: • N, with and where or Ie BSB123 2016 Tommy Tang

  31. Lightbulb • In the lightbulb case study, we compare two population means: 1 & 2(where 1=EXP, 2=CHEEP). • is the best estimator of (12). • In the case study, = 119.3 hrs. • Ie Brand E has a longer lifetime than C by 119.3 hrs on average. • Is the difference of 119.3 strong enough evidence for us to reject H0? Ho: 1 = 2 H1: 1 > 2 Ho: (12) = 0 H1: (12) > 0 ie BSB123 2016 Tommy Tang

  32. Lightbulb (cont) • To test if the difference is strong enough evidence to reject H0, we need to know the probability distribution of the RV: • If X1 and X2 are normal and σ1 and σ2 are known, then is also normally distributed, with • Test statistic: & (Refer to slide 30) BSB123 2016 Tommy Tang

  33. σ1 and σ2 unknown • Population std deviations, σ1 and σ2are usually unknown. • Substitute σ1 and σ2 with s1and s2. • Test-statistic follows t-distribution: • How to obtain the std error, ? BSB123 2016 Tommy Tang

  34. If σ1=σ2 (Equal Variance) • We use the pooled variance method. • The pooled variance: • The test-stat: • with df = (n1 + n2 2) Calculations not required. BSB123 2016 Tommy Tang

  35. If σ1σ2 (Unequal Variance) • The std error of the RV, : • The test-stat: with df: Calculations not required. BSB123 2016 Tommy Tang

  36. Case Study: Brand E lasts longer than Brand C? 1 Ho: (12) = 0 H1: (12) > 0 2 Level of significance,  = 5% 3Test statistic: t 4 Decision Rule Reject Ho, if p-value < 5% 5Calculate the p-value = (2040.3 – 1921) = 119.3 p-value = P[ > 119.3 given Ho ] = 0.234 (23.4%) … From Excel, assuming unequal variances. Excel Output 6 Conclusion Since p-value > 5%, we do not reject Ho. i.e., there is insufficient evidence that brand E … BSB123 2016 Tommy Tang

  37. An important remark Ho: (1 2) = 0 H1: (1 2) > 0 • In the above hypothesis, it is hypothesised that their difference is > 0. • Though the hypothesised difference of zero is most common, we can assume other values as appropriate. • Eg: If Brand E claims its battery lifetimes are on average 100 hours longer than Brand C, then … BSB123 2016 Tommy Tang

  38. Matched pairs BSB123 2016 Tommy Tang

  39. Regular Unleaded or E10? Many buy the more expensive regular unleaded petrol (even though their cars can run on E10) in the belief they will get more mileage. A company with a large car fleet wants to test this belief on 10 of its Corolla. Each car is run with regular and its mileage (km/litre) for the full tank is recorded. The test is repeated with E10 under identical test conditions. Does regular unleaded get more mileage than E10? BSB123 2016 Tommy Tang

  40. Independent Samples (E10 vs Regular) P-value > 0.1 (Do not reject H0) Can we do better than the independent samples t-test? Attempt SSQ8. Excel Output BSB123 2016 Tommy Tang

  41. Case Study (Free Fruit) QUT institutes a free fruit program for its employees to see if job satisfaction will be increased. Employees are asked to respond a questionnaire before and after the implementation of the program. The scores of a random sample of 10 are presented below: Test if the program improves job satisfaction at 1%. BSB123 2016 Tommy Tang

  42. Case Study (Free Fruit) 1 Ho:  = 0 H1: > 0 2 Level of significance,  = 1% 3Test statistic: t 4 Decision Rule Reject Ho, if p-value < 1% 5Calculation • The difference scores, X. • Sample mean and s • s of= s / n • t-score = ? • P-value = P(t > ?) RV = BSB123 2016 Tommy Tang

  43. Case Study (Free Fruit) = 8.5 (calculator) S = 7.4722 1 Ho:  = 0 H1:  > 0 2 Level of significance,  = 1% 3Test statistic: t 8.5 4 Decision Rule Reject Ho, if p-value < 1% RV = 5Calculate the p-value = 8.5 p-value = P[ > 8.5 given Ho ] = P(t > 3.579) = 0.0029 6 Conclusion Since p-value < 1%, we reject Ho. i.e., there is strong evidence that … 3.597 t BSB123 2016 Tommy Tang

  44. A matched pairs t-test is equivalent to a one-sample t-test of the mean difference One-sample t-test of Diff score Matched pairs t-test BSB123 2016 Tommy Tang

  45. Regular Unleaded or E10? • Conduct a paired t-test to determine if regular unleaded has better fuel economy? • If you reject the H0, should you then always use regular unleaded? BSB123 2016 Tommy Tang

  46. One Minute Question BSB123 2016 Tommy Tang

More Related