Regression Analysis

Regression Analysis • In dealing with problems in social sciences, business, or economics, often we are interested in determining whether a noticeable relationship exists between two or more variables. • Do SAT scores predict college performance? Does blood pressure level predict life expectancy? Does reading statistics texts make you a better person? (Absolutely!) Does advertising increase sales? ... • In each of these cases, the following questions occur: • Is there enough reason to suspect that a mathematical relationship exists between these variables? • If there is one, how strong is it? • If our intention is to use the results to make predictions, how reliable are such predictions? • In general, we will use Regression Analysis to estimate a function f(X) that describes the relationship between a continuousdependent variable and one or more independent (explanatory) variables. Y = f(X1, X2, X3,…, Xn)+e

A “simple” example • Consider the first example (see the data file “reg-examples.xls”, sheet “Height-Weight”) • Whenever possible, plot and observe the data. • The scatter plot shows a linear relation between height and weight . So the following regression model is suggested by the data, which refers to the true relationship between the entire population of height and weight values. • The estimated regression function (based on our sample) will be represented as,

Method of least squares • Determining the best fit (line). • Minimize “error sums of squares (ESS)” a.k.a “sum of squared errors (SSE)”. • Select values for b0 and b1 that minimizes the SSE.

Multiple regression • Most regression problems involve more than one independent variable. • If each independent variables varies in a linear manner with Y, the estimated regression function in this case is: • The optimal values for the bi can again be found using the least squares method

Multiple Regression Example • The real estate example – adopted from the text and • Embellished by adding two more independent variables (data): • Yard size in acres and Metro access (light rail, etc) • Refer to the “reg-example.xls” file, the 2nd sheet: “MultipleReg” • The charge: • Develop a statistically significant regression model to estimate “home prices” by following the “backwards stepwise” procedure on the next slide. • Use the model to predict home prices for two homes with these specs: • 2100 sq. ft., 2-car garage, 3 bedrooms, and 0.33 acre yard, not close to metro • 2000 sq. ft., 1-car garage, 3 bedrooms, and 0.25 acre yard, close to metro

Steps in regression analysis • Hypothesize the form of the (linear) model. • Estimate the unknown parameters, ß1, ß2 , ß3 , ... ßk • Check the usefulness of the model: • Hypotheses for testing whether a general linear model is useful is predicting Y: • Ho : ß1 = ß2 = ß3 = ... = ßk = 0 • HA : At least one of the ß parameters in Ho is nonzero. • Test statistic: F-statistic = MSR / MSE • If the model is deemed adequate (passes the F-test; rejected H0 ) then go to step 4(else have fun!) • Conduct t-tests (significance tests) on ß parameters (slopes). • Remove the most insignificantindependentvariable (parameter) , run regression, and go to step 4. • Repeat steps 4 and 5until all remaining independent variable parameters (slopes) are significant, then go to step 7 • If the intercept (ß0 ) is insignificant then remove it, run regression one more time.

The adjusted R2 statistic • The (adjusted) R2 statistic indicates how well an estimated regression function fits the data. • 0<= R2 <=1 • It measures the proportion of the total variation in Y around its mean that is accounted for by the estimated regression equation. • The original “unadjusted” R2 can be artificially inflated by adding any independent variable to the model. • Adjusted R2 can be used to tell whether adding an additional variable really helps to improve the model.

Categorical Variables in Regression Models: Dummy Coding • One of the ways to model categorical variables in regression is re-code them using dummy (binary) variables. • In dummy coding one group (category) is considered to be the reference group (also known as the control group) and new dummy (binary) variables are created to identify which category (group) the other observations are in. • For example gender variable (M, F) is typically coded as 0/1. • Other examples of two category variables are: • House has a pool (Y,N) coded as 1 or 0. • Has an MBA degree (Y,N) coded as 1 or 0.

Categorical Variables in Regression Models: Dummy Coding when k >2 • When the categorical variable has kpossible values (categories) where k > 2 then we need to use k-1 binary variables to model the possible outcomes. • For example “marital status” which may have 4 categories (Single, M, D, W) hence requires 3 dummy (binary) variables to model it • Another example is “school district” where there are 3 districts (A,B,C) can be coded using 2 binary variables. • Consider the expanded real estate example…Excel time • Other examples: see other sheets Alumni, AnoterEx, … • Develop regression models via “backward stepwise” approach

Example 2 A mental health agency measured the self-esteem score for randomly selected individuals with disabilities who were involved in some work activity within the past year. The sheet “Self Esteem” provides the data, including the individuals' marital status, length of work, type of support received (direct support includes job-related services such as job coaching and counseling), education, and age. Use multiple linear regression for predicting self-esteem as a function of the other variables.

Assumptions of Regression Analysis • Key assumptions are linearity, error terms are independent and normally distributed around zero with a constant variance. • Checking for these are relatively easy by: • Residual plots: should be randomly scattered above and below zero • When a pattern (e.g., curvilinear) is observed then the form of the equation is likely to be wrong. Solution: Try a nonlinear version of that independent variable (X2) • Similarly if the error terms show a pattern of unequal variances (increasing/decreasing) often transforming the dependent variable using the square root or log base 10 resolves this issue • Normal probability plot: should be approximately linear • When this assumption is clearly violated, again, transforming the dependent variable using the square root or log base 10 resolves this issue

Data types • Nominal data has no order, and the assignment of numbers to categories is purely arbitrary (ex., 1=Male, 2=Female, etc.). Because of lack of order or equal intervals, one cannot perform arithmetic (+, -, /, *) or logical operations (>, <, =) on nominal data. • Ordinal data has order, but the intervals between scale points may be uneven. Rank data are usually (see below) ordinal, as in students' rank in class. The interval distance from the top student to the second-highest student may be great, but the interval from the second-ranked student to the third-ranked may be very close. Because of lack of equal distances, arithmetic operations are impossible with ordinal data, which are restricted to logical operations (more than, less than, equal to). For instance, given a person of rank 50 and a person of rank 25 in a school class of 100, where rank 100 is highest achievement, one cannot divide 25 into 50 to conclude that the first person has twice the achievement of the second. However, one can say the first person represents more achievement than the second person. • Interval data has order and equal intervals. Counts are interval, such as counts of income, years of education, or number of Democratic votes. Ratio data are interval data which also have a true zero point. Temperature is not ratio because zero degrees is not "no temperature," but income is ratio because zero dollars is truly "no income," For most statistical procedures the distinction between interval and ratio does not matter and it is common to use the term "interval" to refer to ratio data as well. Occasionally, however, the distinction between interval and ratio becomes important. With interval data, one can perform logical operations, add, and subtract, but one cannot multiply or divide. • Source: NCSU, Dr. Dave Garson

Regression Analysis