Regression

Regression

Learning Objectives By the end of this lecture, you should be able to: • Describe what is meant by regression. Be able to describe correlation and the relationship between the two. • Generate a regression model both by using the calculator to calculate b0 and b1, and by using statistical software such as SPSS. • Describe why extrapolation can lead to misleading conclusions. • Working with categorical variables – selecting the best tool for the job when comparing categories.

Linear Regression - Overview • Once we have convinced ourselves that a linear relationship does in fact exist between two variables, and that the relationship is causal (more on this later), we have a terrific tool for making predictions. • For example, we can predict the blood alcohol level based on the number of beers consumed, • We can predict by using the graph line itself, but it can be difficult to estimate exactly where on the axis a line falls. An even more useful way would be to take that line and turn it into a formula. This technique is called ‘regression’. • For example: Once we have a formula, we can simply plug in the value for number of beers and the formula would tells us the predicted BAC level.

Developing a more precise model • If you were asked for the blood alcohol level (BAC) for 2.5 beers, you would have to estimate both the location of 2.5 on the x-axis and the quantity of BAC on the y-axis. Your prediction would be imprecise. • Answer = 0.028 ???? • So the next step is to generate a formula from our model which will give us more precise predictions . Ultimately, we will come up with the model: BAC = -0.013 + 0.018 * num_beers

Is regression the appropriate tool for the job? • Reminder: A key point of this course is to recognize when you can and can NOT use a statistical tool. • This is one of those times: It is VERY important that you recognize when a regression model is NOT an appropriate tool. • Before doing a regression analysis, the following must ALL be true: • The relationship is linear • ‘r’ is not very weak (I.e. is not too close to 0) • The relationship is ‘causal’ (important – but later…)

Summary on using correlation to build a model • The purpose of all of this (taking data, graphing it, and looking for correlation, generating a regression ine) is to generate a model (a formula) that allows us to infer information about the population and/or to make predictions. • Eg: If we give someone 6.5 beers, what do we think their BAC is likely to be? Steps • Obtain data • e.g. Do a study where you take a group of people, determine how many beers they drank, and then measure their BAC. • Graph that data on a scatterplot. • If you believe there is a correlation, draw a regression line (we’ll use software for this step). • From that regression line, generate a formula (a model)

The Regression Model When dealing with “single linear regression” (the only regression model we will deal with in this course), the formula generated from the model will be in the form ‘y = mx + b’ that many of you probably encountered in high school. The only difference is that we will use more “statistically appropriate letters and symbols. You will need to know these (sorry)! Different people use different symbols. In this course, we will use: • b0to refer to the intercept (what you probably called ‘b’). • b1to refer to the slope (what you probably called ‘m’).

The Regression Model • b0refers to the intercept (what you probably called ‘b’). The intercept is where the regression line crosses the y-axis. • On this graph it is about -0.013 • b1to refer to the slope (what you probably called ‘m’). The slope refers to the ‘angle’ of the line.

# Beers vs BAC – The regression model Let’s take the generic model and apply it to our # beers v.s. BAC study: BAC = b0 + b1 * # of beers The trick is to find out what the values for ‘b0’ and ‘b1’ are.

How to calculate b0 and b1(Good news: It’s actually pretty easy!) First we calculate the slope of the line, b1: r is the correlation. sy is the standard deviation of the response variable y. sx is the the standard deviation of the explanatory variable x. Once we know the slope (b1), we can easily calculate the y-intercept (b0): where x and y are the sample means of the x and y variables Important: You WILL be asked to do these calculations. And I hope you agree the calculations themselves are quite easy. In addition, I will give you the formulas on a cheat-sheet during your exams. HOWEVER: The key is for you to recognize when they can (and can NOT!) be used.

Variable names FYI, not all calculators and software use the same variables! For example, some use: And some use: Make sure you know the variables used by YOUR software/calculator before you answer homework or exam questions!

What’s up with the hat?? Gas Consumption^ = b0 + b1 * Heating The hat (^) is a symbol that tells us that this result is a predicted value as calculated using the regression line model, as opposed to a value that comes from the original data (observed data). For example, look at the (tiny) purple dot for x=24. This dot was one of our original datapoints that says that on a 24 degree day, the average gas consumption was about 6.4. So 6.4 is the observed result from our data. However, the regression model is somewhere around 5.6. Similarly, for x=26, y=5.3 but y^=6.0. Again, these are symbols I want you to be comfortable with.

What’s up with the hat?? Gas Consumption^ = b0 + b1 * Heating So if for a heating value of 24, I say: ‘Gas Consumption’ = 6.4, then I am saying that this particular result came from the observed data . (That is, data that was collected somewhere along the way). If, however, for a heating value of 24, I say: ‘Gas Consumption^’ = 5.6, then I am saying that this particular result is predicted from a regression model. (I.e. As opposed to an observed value that was collected somewhere).

Example using SPSS • Let’s use the software to generate a regression model for the beer  blood alcohol level discussed earlier. • In SPSS, open beer_bac.sav (you can find this file from the datasets on the class webpage). • To generate the graph: Graph >> Legacy Dialogs >> Scatter Dot >> Simple Scatter. Click ‘Define’ • Remember to always place your explanatory variable (in this case the number of beers variable) on the x-axis and your response variable (in this case, the bac variable) on the y-axis. You can click on the variable and click the arrow to move it into the appropriate field. Click ‘OK’. • Also remember that it is very important that you do not confuse the explanatory vs response variables!

Example using SPSS contd • A new window will open showing your scatterplot and some additional information. • Generate Regression line: use chart editor (double click on plot)  choose the icon for ‘Add Fit Line’ , • You will see a ‘Properties’ window open up. Choose ‘Linear’. Then close the Properties window. • To calculate Parameters: SPSS will also calculate bo and b1 for you. • Close the ‘Chart Editor’ window and return to the output window. • Click: Analyze >> Regression >> Linear. “Dependent” refers to the Response variable. Independent refers to the explanatory variable. (Recall that these terms are a bit flawed, but as I mentioned earlier, they are still sometimes used). Click ok. • Under ‘Coefficients’: the first value under the ‘B’ column is b0 (the intercept). The value below b0 is b1 (slope). • We will talk about the ‘model summary’ table later.

The graph generated by SPSS: Note the R2 value of 0.8 that SPSS provides with the graph. As you might expect, if you take the square root of this value, you will have your value for ‘r’. This gives us an r of about 0.89. From this we can say that we have a pretty strong, positive correlation between # of beers and BAC level.

SPSS’ Coefficients table tells us b0 and b1 b0 (y-intercept) b1 (slope) Regression Model: BAC^ = -0.013 + 0.018 * num_beers

Example of Minitab output intercept slope R2

Example of Excel output r R2 intercept slope

Correlation and Regression Be sure that you are clear on the definition of each. I will probably have a question on your exam(s) that asks you to define correlation and regression. Or I may ask you to explain the difference between the two. The definitions are below: Correlation quantifies the strength and direction of a relationship between two (quantitative) variables. Regression describes the variation in the response variable (y) given change in the explanatory variable (x).

Correlation v.s. Regression restated • Correlation is a single number that tells you about the strength of the relationship. • It in no way helps you predict a specific value for ‘y’ give an ‘x’. • Regression is the process of generating a model to allow you to make predictions.

Nobody in the study drank 6.5 beers, but by finding the value of BAC from the regression line for x = 6.5 we would predict a blood alcohol content of 0.104 mg/ml. Making predictions The equation that we have derived using our regression formula allows us to predict y for a given value of x.

= - ˆ y 0 . 125 x 41 . 4 = - ˆ y 0 . 125 x 41 . 4 (in 1000’s) There is a positive linear relationship between the number of powerboats registered and the number of manatee deaths. From the regression line, we generate the equation: Thus if we were to limit the number of powerboat registrations to 500,000, what could we expect for the number of manatee deaths? Roughly 21 manatees.

!!! !!! Extrapolation Extrapolation is the use of a regression line for predictions outside the range of x values used to obtain the line. This can be an extremely misleading thing to do, as seen here. Height in Inches Height in Inches

Another example of extrapolation • In this example, there is a strong linear relationship between the time and the temperature. As the time progresses, the temperature keeps dropping. The extrapolation, of course, results from the fact that while the time observations IN THIS RANGE are linear, the graph does level off at a later point and then begins sloping upwards. So, you could do a regression analysis on this particular period, but you could not extrpolate the results to a date before 11/21 or beyond 1/23.

The y intercept Taken by itself, the y-intercept is often meaningless. In fact, it is sometimes not even a possible value. For example, the y-intercept in our beer / BAC model tells us that at 0 beers, we have a negative blood alcohol content, which makes no sense… y-intercept shows negative blood alcohol But the intercept isnecessary for determining the regression model. -0.013

Categorical variables in scatterplots Sometimes, even data that is purely quantitative is best divided up into multiple categories. If we neglect to do, we risk drawing entirely false conclusions. What may look like a positive linear relationship is in fact a series of unrelated negative linear associations. Plotting different habitats in different colors allows us to make that important distinction. Had we neglected to do so, we would have likely drawn the straight line (shown) and incorrectly concluded that there is a positive linear relationship.

Key Point • If one of your variables can be divided into categories, plot each datapoint using a different symbol or color depending on its category. • Another option is simply to use a separate graph for each category. Still, it is often more helpful to keep the two plots on the same chart if doing so allows you to observe differences between the categories.

Comparison of men and women racing records over time. Each group shows a very strong negative linear relationship that would not be apparent without the gender categorization. Relationship between lean body mass and metabolic rate in men and women. Both men and women follow the same positive linear trend, but women show a stronger association. As a group, males typically have larger values for both variables.

Categorical explanatory variables So far, we’ve drawn our scatterplots using quantitative variables (even when we broke them up into different categories). When the explanatory variable is categorical, a scatterplot might not be your best choice. However, there are ways of comparing compare different categories side by side. Level of Education (categorical) vs Income (quantitative response). Comparing 5 different categories on a single graph. Boxplots are a great choice for this kind of comparison.

Regression

Regression

Presentation Transcript

Regression Analysis Simple Regression

Regression

Regression

Regression

Regression

Regression

Regression

REGRESSION

Regression

Regression

REGRESSION

Regression

Regression Linear Regression Regression Trees

Regression Linear Regression

Regression

REGRESSION

Regression

Regression

Regression Analysis Simple Regression

REGRESSION

Regression

Regression