MATH 1107 Elementary Statistics

MATH 1107Elementary Statistics Lecture 7 Regression Analysis

MATH 1107 – Regression Analysis Without question, Regression Analysis is the most heavily used tool in Statistical Modeling. This is true because it enables you to predict or explain a dependent variable based upon one or more independent variables. Regression Analysis is used in almost every industry.

MATH 1107 – Regression Analysis • For Example: • If you were a sports agent, how would you propose a “reasonable” contract salary for your client? • If you are interested in selling your house, how can you determine an appropriate market price? • If you are the head of the admissions department in a University, how do you decide who gets accepted? • If you are an investment banker, how do you decide which funds to hold in your portfolio?

MATH 1107 – Regression Analysis All of the “variables” underlined would be the dependent variables – what would be the associated independent variables that we might use to predict or explain these dependent variables?

MATH 1107 – Regression Analysis The first step in predicting or explaining a dependent variable using an independent variable*, is evaluating the correlation of the two variables using a scatterplot. Lets return to Median Household Income and Deathrate… * although many independent variables can be used in regression analysis, in these notes, we will be using only one.

MATH 1107 – Regression Analysis

MATH 1107 – Regression Analysis • Using the =CORREL(array1, array2) function in EXCEL, we can determine that the correlation between Median Income and Death Rate is -.61. • This indicates three things: • The relationship is fairly strong – the value of -.61 is closer to –1 than it is to 0. • The direction is negative/inverse. Meaning that as one variable goes up, the other goes down. • The R2 value of a predictive regression equation using these two variables is .37.

MATH 1107 – Regression Analysis • Since the correlation is pretty good, we can use these two variables to create a linear model – a linear model: • It will have an equation in the form y=mx+b; • It will be the “best fit” of the data; • it will minimize the distances between the “actual” data points and the “predicted” points (this distance is called a “residual”) • it will enable us to predict the death rates in other states, that were NOT included in the original dataset.

MATH 1107 – Regression Analysis From this analysis, the best fit line is: This equation was provided by EXCEL (tick the “Display Equation on Chart” option under the “Add Trendline” function). A better way to represent this equation is: State Death Rate = (-0.0002 * Median State Income) + 13.255 y = -0.0002x + 13.255

MATH 1107 – Regression Analysis Lets interpret these values directly: -.0002 is the slope of the line. It can be translated directly to mean “For every one dollar of additional median income, the death rate will decrease by .0002”. The slope tells you how the dependent variable changes with one unit change in the independent variable.

MATH 1107 – Regression Analysis Lets interpret these values directly: 13.255 is the Y-intercept. Algebraically, this is the point at which the line will cross the y-axis when the x-value is 0. Since it is not reasonable to have a state with 0 Median Income, its not really interpreted directly.

MATH 1107 – Regression Analysis Now, using the model we developed, predict the death rates for the states below:

MATH 1107 – Regression Analysis Now, lets determine our “residuals” or how far off we were for each prediction.

MATH 1107 Elementary Statistics