Lecture 8 Relationships between Scale variables: Regression Analysis

Lecture 8Relationships between Scale variables: Regression Analysis Graduate School Quantitative Research Methods Gwilym Pryce g.pryce@socsci.gla.ac.uk

Notices: • Register

Plan: • 1. Linear & Non-linear Relationships • 2. Fitting a line using OLS • 3. Inference in Regression • 4. Ommitted Variables & R2 • 5. Types of Regression Analysis • 6. Properties of OLS Estimates • 7. Assumptions of OLS • 8. Doing Regression in SPSS

1. Linear & Non-linear relationships between variables • Often of greatest interest in social science is investigation into relationships between variables: • is social class related to political perspective? • is income related to education? • is worker alienation related to job monotony? • We are also interested in the direction of causation, but this is more difficult to prove empirically: • our empirical models are usually structured assuming a particular theory of causation

Relationships between scale variables • The most straight forward way to investigate evidence for relationship is to look at scatter plots: • traditional to: • put the dependent variable (I.e. the “effect”) on the vertical axis • or “y axis” • put the explanatory variable (I.e. the “cause”) on the horizontal axis • or “x axis”

Scatter plot of IQ and Income:

We would like to find the line of best fit:

What does the output mean?

Sometimes the relationship appears non-linear:

… and so a straight line of best fit is not always very satisfactory:

Could try a quadratic line of best fit:

But we can simulate a non-linear relationship by first transforming one of the variables:

… or a cubic line of best fit:(overfitted?)

Or could try two linear lines:“structural break”

2. Fitting a line using OLS • The most popular algorithm for drawing the line of best fit is one that minimises the sum of squared deviations from the line to each observation: Where: yi = observed value of y = predicted value of yi = the value on the line of best fit corresponding to xi

Regression estimates of a, b:or Ordinary Least Squares (OLS): • This criterion yields estimates of the slope b and y-intercept a of the straight line:

3. Inference in Regression: Hypothesis tests on the slope coefficient: • Regressions are usually run on samples, so what can we say about the population relationship between x and y? • Repeated samples would yield a range of values for estimates of b ~ N(b, sb) • I.e. b is normally distributed with mean = b = population mean = value of b if regression run on population • If there is no relationship in the population between x and y, thenb = 0, & this is our H0

What does the standard error mean?

Hypothesis test on b: • (1) H0: b = 0 (I.e. slope coefficient, if regression run on population, would = 0) H1: b 0 • (2) a = 0.05 or 0.01 etc. • (3) Reject H0 iff P < a • (N.B. Rule of thumb: P < 0.05 if tc 2, and P < 0.01 if tc 2.6) • (4) Calculate P and conclude.

Example using SPSS output: • (1) H0: no relationship between house price and floor area. H1: there is a relationship • (2), (3), (4): • P = 1- CDF.T(24.469,554) = 0.000000 Reject H0

4. Ommitted Variables & R2Q/ is floor area the only factor?How much of the variation in Price does it explain?

R-square • R-square tells you how much of the variation in y is explained by the explanatory variable x • 0 < R2 < 1 (NB: you want R2 to be near 1). • If more than one explanatory variable, use Adjusted R2

Example: 2 explanatory variables

Scatter plot (with floor spikes)

3D Surface Plots:Construction, Price & UnemploymentQ = -246 + 27P - 0.2P2 - 73U + 3U2

Construction Equation in a SlumpQ = 315 + 4P- 73U + 5U2

5. Types of regression analysis: • Univariate regression: one explanatory variable • what we’ve looked at so far in the above equations • Multivariate regression: >1 explanatory variable • more than one equation on the RHS • Log-linear regression & log-log regression: • taking logs of variables can deal with certain types of non-linearities & useful properties (e.g. elasticities) • Categorical dependent variable regression: • dependent variable is dichotomous -- observation has an attribute or not • e.g. MPPI take-up, unemployed or not etc.

6. Properties of OLS estimators • OLS estimates of the slope and intercept parameters have been shown to be BLUE (provided certain assumptions are met): • Best • Linear • Unbiased • Estimator

“Best” in that they have the minimum variance compared with other estimators (i.e. given repeated samples, the OLS estimates for α and β vary less between samples than any other sample estimates for α and β). • “Linear” in that a straight line relationship is assumed. • “Unbiased” because, in repeated samples, the mean of all the estimates achieved will tend towards the population values for α and β. • “Estimates” in that the true values of α and β cannot be known, and so we are using statistical techniques to arrive at the best possible assessment of their values, given the information available.

7. Assumptions of OLS: For estimation of a and b to be BLUE and for regression inference to be correct: • 1. Equation is correctly specified: • Linear in parameters (can still transform variables) • Contains all relevant variables • Contains no irrelevant variables • Contains no variables with measurement errors • 2. Error Term has zero mean • 3. Error Term has constant variance

4. Error Term is not autocorrelated • I.e. correlated with error term from previous time periods • 5. Explanatory variables are fixed • observe normal distribution of y for repeated fixed values of x • 6. No linear relationship between RHS variables • I.e. no “multicolinearity”

8. Doing Regression analysis in SPSS • To run regression analysis in SPSS, click on Analyse, Regression, Linear:

Select your dependent (i.e. ‘explained’) variable and independent (i.e. ‘explanatory’) variables:

e.g. Floor area and bathrooms:Floor area = a + b Number of bathrooms + e

Confidence Intervals for regression coefficients • Population slope coefficient CI: • Rule of thumb:

e.g. regression of floor area on number of bathrooms, CI on slope: b = 64.6  2  3.8 = 64.6  7.6 95% CI = ( 57, 72)

Confidence Intervals in SPSS:Analyse, Regression, Linear, click on Statistics and select Confidence intervals:

Our rule of thumb said 95% CI for slope = ( 57, 72). How does this compare?

Past Paper: (C2)Relationships (30%) • Suppose you have a theory that suggests that time watching TV is determined by gregariousness • the less gregarious, the more time spent watching TV • Use a random sample of 60 observations from the TV watching data to run a statistical test for this relationship that also controls for the effects of age and gender. • Carefully interpret the output from this model and discuss the statistical robustness of the results.

Reading: • Regression Analysis: • *Field, A. chapters on regression. • *Moore and McCabe Chapters on regression. • Kennedy, P. ‘A Guide to Econometrics’ • Bryman, Alan, and Cramer, Duncan (1999) “Quantitative Data Analysis with SPSS for Windows: A Guide for Social Scientists”, Chapters 9 and 10. • Achen, Christopher H. Interpreting and Using Regression (London: Sage, 1982).

Lecture 8 Relationships between Scale variables: Regression Analysis

Lecture 8 Relationships between Scale variables: Regression Analysis

Presentation Transcript

Notes 6: Multiple Linear Regression

Chapter 14 Multiple Regression Analysis and Model Building

Association Analysis

Illustration of Regression Analysis

Multilevel Regression Models

Discriminant Analysis – Basic Relationships

Relationships Regression

Statistical Inference and Regression Analysis: Stat-GB.3302.30, Stat-UB.0015.01

Correlation and L inear Regression

Statistical Inference and Regression Analysis: GB.3302.30

Lecture 19 Flow Analysis flow analysis in prolog; applications of flow analysis

Chapter 9 Regression with Time Series Data: Stationary Variables

SIMPLE AND MULTIPLE REGRESSION

Lecture 5 Advanced (= Modern) Regression Analysis

Binary Logistic Regression

Chapter 12: Analyzing Association Between Quantitative Variables: Regression Analysis

Spatial Analysis What is it?

MT2004

Regression Analysis with SPSS

Lecture Slides

Discriminant Analysis – Basic Relationships