Chapter 16

Chapter 16 Understanding Relationships – Numerical Data Part 2 Created by Kathy Fritz

The Simple Linear Regression Model

You might convert x = temperature in degrees centigrade to y = temperature in degrees Fahrenheit using Suppose you want to convert 20˚C into Fahrenheit. 20˚C = 68˚F Temperature in Fahrenheit This is a deterministic relationship. The value of the independent variable (centigrade temperature) is all that is needed to determine the value of the dependent variable (Fahrenheit temperature). Temperature in centigrade

Now suppose we were to investigate the relationship between y = the first-year college grade point average and x = high school grade point average. Is the first-year college grade point average determined solely by the high school grade point average? Explain. The first-year college grade point average and the high school grade point average do NOT have a deterministic relationship. The equation for a probabilistic model is: Where e is an “error” variable A description of the relationship between two variables that are not deterministically related can be given by a probabilistic model.

The simple linear regression model assumes that there is a line with y-intercept a and slope b, called the population regression line. When a value of the independent variable x is fixed and an observation on the dependent variable y is made, y Without the random deviation e in the equation, all observed (x, y) points would fall exactly on the population regression line. Population regression line (slope b) e1 a e2 x x1 x2

Basic Assumptions of the Simple Linear Regression Model • The distribution of e at any particular value of x is normal. Before you actually observe a value of y for any particular value of x, you are uncertain about the value of e (random deviation from the regression line). It could be positive, negative, or even 0. The linear regression model makes some assumptions about the distribution of eat any particular x value in the population. x1 x2 x3

Basic Assumptions of the Simple Linear Regression Model • The distribution of e at any particular value of x is normal. • The distribution of e at any particular x value has mean value 0. That is, me = 0. Because the values of e can be negative or positive, the sum of the values of eat any particular x value will be zero. Thus,me = 0. x1 x2 x3

Basic Assumptions of the Simple Linear Regression Model • The distribution of e at any particular value of x is normal. • The distribution of e at any particular x value has mean value 0. That is, me = 0. The standard deviation of e is the same for any particular value of x. This standard deviation is denoted by se. x1 x2 x3

Basic Assumptions of the Simple Linear Regression Model • The distribution of e at any particular value of x is normal. • The distribution of e at any particular x value has mean value 0. That is, me = 0. • The standard deviation of e is the same for any particular value of x. This standard deviation is denoted by se. • The random deviations e1, e2, . . ., en associated with different observations are independent of one another.

The population regression line passes through the means of the y values. Thus the slope b is the mean or expected change in yassociated with a 1 unit increase in x. y a + bx3 a + bx2 se is the same for any particular x value The standard deviation of y for any fixed value of x* is also se a + bx1 The mean of y values at a fixed value x* is y = a + bx* x Just as there is variability in the values of e at any particular value of x, there is also variability in the y values. x1 x2 x3

Another look at se The smaller se, the closer the points are to the regression line. The larger se, the farther the points are from the regression line.

The estimates of the slope and the y intercept of the population regression line are the slope and y intercept, respectively, of the least squares line, . The values of a and b are usually obtained using statistical software or a graphing calculator. Let x* denote a specified value of the independent variable x. Then a + bx* has two different interpretations: 1. It is a point estimate of the mean y value when x = x*. 2. It is a point prediction of an individual y value to be observed when x = x*.

Baby’s Weight (g) Mother’s Age (yrs) Medical researches have noted that adolescent females are much more likely to deliver low-birth-weight babies than are adult females. Because low-birth-weight babies have higher mortality rates, a number of studies have examined the relationship between birth weight and mother’s age for babies born to young mothers. The following data is on x = maternal age (in years) and y = birth weight of baby (in grams). The scatterplot shows a linear pattern and the spread in the y values appears to be similar across the range of x values. This supports the appropriateness of the simple linear regression model. Sketch a scatterplot of these data.

Baby’s Weight (g) Mother’s Age (yrs) Birth Weight Continued . . . The following data is on x = maternal age (in years) and y = birth weight of baby (in grams). The weight of babies increases approximately 245.15 grams for each increase of 1 year in the mother’s age. What is the point estimate for the mean weight of babies born to 18-year-old mothers? Beware of the danger of extrapolation. That is, be careful when trying to make an estimate or prediction for any x value much outside the range of the observed x values in the data. This is also the prediction of the weight of a single baby born to a mother 18 years of age. This is the point estimate for the mean weight of all babies born to 18-year-old mothers.

The statistic for estimating the variance is The subscript “e” is a reminder that you are estimating the variance of the “errors” or residuals. The value of se, the estimated standard deviation about the population regression line, is interpreted as the typical amount by which an observation deviates from the population regression line. where The estimate of se is the estimated the standard deviation Note that the degrees of freedom associated with estimating or in simple linear regression is df = n - 2

Recall, the coefficient of determination, r2, is the proportion of variability in ythat can be explained by the approximate linear relationship between x and y. How do we know if the estimated regression equation will be useful model for predicting y values from x? The residual plot and the values of seand r2can be used to determine the estimated regression equation’s usefulness.

Wildlife biologists monitor the ecological health of the Rocky Mountain elk. The equipment, manpower, and time to make direct measurement of the elk weights are difficult and expensive. Biologists found that they could reliably estimate the weight of an elk by measuring the chest girth and then using linear regression to estimate the weight. They measured the chest girth and weight of 19 Rocky Mountain elk. There appears to be a strong positive linear relationship between the chest girth and weight of elk.

Elk Weight Problem Continued . . . Partial Minitab regression output is shown below. This is the estimated regression equation. Approximately 86.5% of the observed variation in elk weight can be attributed to the linear relationship between weight and chest girth. The magnitude of a typical deviation from the least-squares line is about 23.6626 kg, which is relatively small in comparison to the y values (shown in the scatterplot).

Inferences Concerning the Slope of the Population Regression Line

Properties of the Sampling Distribution of b When the four basic assumptions of the simple linear regression model are satisfied, the following statements are true: • The mean value of b is b. That is, mb = b, so the sampling distribution of b is centered at the value of b. Since b is almost always unknown, it must be estimated from independently selected observations. The slope b of the least-squares line gives a point estimate for b. Sincesbis usually unknown, the estimated standard deviation of the statistic b is When the four basic assumptions of the simple linear model are satisfied, the probability distribution of the standardized variable is the t distribution with df = (n - 2). The standard deviation of the statistic bis The statistic b has a normal distribution (a consequence of the model assumption that the random deviation e is normally distributed.)

Confidence Interval for b When the four basic assumptions of the simple linear regression model are satisfied, a confidence interval for b, the slope of the population regression line, has the form where the t critical value is based on df = n – 2.

The dedicated work of conservationists for over 100 years has brought the bison in Yellowstone National Park from near extinction to a herd of over 3000 animals. It is important to monitor and manage the size of the bison population. Researchers have studied a number of environmental factors to better understand the relationship between bison reproduction and the environment. One factor thought to influence reproduction is stress due to accumulated snow, which makes foraging more difficult for the pregnant bison. Data from 1981-1997 on y = spring calf ratio (SCR) and x = previous fall snow-water equivalent (SWE) are shown on page 750. The researchers were interested in estimating the mean change in spring calf ratio associated with each additional cm in snow-water equivalent.

Bison Population Problem Continued . . . Step 1 (Estimate): The value of b, the mean increase in spring calf ratio for each additional 1 cm of snow-water equivalent, will be estimated. Step 2 (Method): Because the answers to the four key questions are estimation, sample data, two numerical values, and one sample, a confidence interval forb, the slope of the population regression line, will be considered. A 95% confidence level will be used.

Bison Population Problem Continued . . . Step 3 (Check): • You will need to assume that these 17 years are representative of yearly circumstances at Yellowstone and that each year’s reproduction and snowfall is independent of previous years. • A scatterplot of the data looks linear and the spread does not seem different for different values of x. • Because the boxplot of the residuals is approximately symmetrical and there are no outliers, it is reasonable to think that the distribution of e is approximately normal.

Bison Population Problem Continued . . . Step 4 (Calculate): JMP regression output is shown here: df = 17 – 2 = 15 The t critical value for a 95% confidence level and df = 15 is 2.13. b ± (t critical value) sb = -0.0137 ± (2.13)(0.005989) = (-0.265, -0.0009) Slope b sb

Bison Population Problem Continued . . . Step 5 (Communicate Results): Confidence Interval: You can be 95% confident that the true average change in spring calf ratio associated with an increase of 1 cm in the snow-water equivalent is between -0.0265 and -0.0009. Confidence level: The method used to construct this interval estimate is successful in capturing the actual value of the slope of the population regression line about 95% of the time.

Summary of Hypothesis Tests Concerning b Appropriate when the four basic assumptions of the simple regression model are reasonable: • The distribution of e at any particular x value has a mean of 0 (me = 0). The standard deviation of e is se, which does not depend on x. The distribution of e at any particular x value is normal. The random deviations e1, e2, …, en associated with different observations are independent of one another.

Summary of Hypothesis Tests Concerning b Continued . . . When these conditions are met, the following test statistic can be used: where b0 is the hypothesized value from the null hypothesis. Form of the null hypothesis: H0: b = b0 When the assumptions of the simple linear model are reasonable and the null hypothesis is true, the t test statistic has a t distribution with df = n – 2.

Summary of Hypothesis Tests Concerning b Continued . . . Associated P-Value: area to left of t under the appropriate t curve Ha: b < b0 Ha: b ≠ b0 2(area to the right of t) if t is positive or 2(area to the left of t) if t is negative

Inference for a population slope generally focuses on two questions: (1) What are plausible values for the population slope? (2) Is the population slope different from zero? This question can be addressed by calculating a confidence interval. This question can be answered by using the hypothesis testing procedure with a null hypothesis H0: b = 0 When the null hypothesis H0: b = 0 is true, the population regression line is a horizontal line. If b is in fact equal to 0, knowledge of x will be of no use – it will have no “utility” for predicting y. This test of H0: b = 0 versus Ha:b ≠ 0 is called the model utility test for simple linear regression.

The Model Utility Test for Simple Linear Regression The model utility test for simple linear regression is the test of H0: b = 0 versus Ha: b≠ 0 The null hypothesis specifies that there is no useful linear relationship between x and y, whereas the alternative hypothesis specifies that there is a useful linear relationship between x and y. The test statistic is the t ratio: If H0 is rejected, you can conclude that the simple linear regression model is useful for predicting y.

When you hear a song on your car radio, you probably remember title of the song, the artist, and even when the song was released. An investigator wants to study this phenomenon. He compiled a list of songs from Rolling Stone, Billboard, and Blender lists of songs plus some recent songs familiar to college students. Twenty-three college students were then exposed to 56 clips of songs. Most of these students had had musical training, and they listened to popular music for an average of 21.7 hours per week. After hearing three short clips from a song (only 400 ms in duration), the students were asked in what year each of the songs was released. Let’s perform a model utility test to answer this question. The accompanying data show the actual release year and the average of the release years given by the students. Is there a relationship between the judged and actual release year for these songs?

Song Recognition Problem Continued . . . Step 1 (Hypotheses): H0: b = 0 Ha: b ≠ 0 where b is the slope of the population regression line of the judged release year and the actual year Step 2 (Method): Because the answers to the four key questions are hypothesis testing, two numerical variables in a regression setting, and one sample, a hypothesis test for the slope of a population regression line will be considered. A significance level of 0.05 will be used.

Song Recognition Problem Continued . . . Step 3 (Check): For this example you can assume that the assumptions are reasonable and proceed with the model utility test. (We will see how to check if the four assumptions of the simple linear regression model are reasonable in the next section.)

Song Recognition Problem Continued . . . Step 4 (Calculate): JMP regression output is shown here: P-value = 2P (t > 13.48) ≈ 0 Slope b sb

Song Recognition Problem Continued . . . Step 5 (Communicate Results): Because the P-value is less than the selected significance level, the null hypothesis is rejected. Decision: Reject H0 Conclusion: The sample data provide convincing evidence that there is a useful linear relationship between the actual release year and the judged release year.

Checking Model Adequacy

Checking Model Adequacy The simple linear regression model is y = a + bx + e where e represents the random deviation of a yvalue from the population regression line a + bx. The methods, confidence interval for slope and the model utility test, require some assumptions about the random deviations in the simple linear regression model be met in order for inference to be valid. These assumptions include: At any particular x value, the distribution of e is normal. At any particular x value, the standard deviation of e is se, which is constant over all values of x (that is, se does not depend on x).

Residual Analysis If the deviations e1, e2, . . . , en from the population line were available, they could be examined for any inconsistencies with model assumptions. However, these deviations are e1 = y1 – (a + bx1) en= yn– (a +bxn) Any observation that gives a large positive or negative residual should be examined carefully for any unusual circumstances, such as a recording error or nonstandard experimental condition. Instead, diagnostic checks MUST be based on the residuals which are the deviations from the estimated regression line. These values of e can ONLY be calculated if a and b are known, which is almost never the case.

Recall, me= 0. So, the numerator is really residual – 0. Residual Analysis Identifying residuals with unusually large magnitudes is made easier by inspecting standardize residuals. Because residuals at different x values have different standard deviations (depending on the value of x for that observation), computing the standardized residuals can be tedious. Most statistical software will perform this calculation.

Revisiting the Elk Example 16.3 introduced data on x = chest girth (in cm) and y = weight (in kg) for a sample of 19 Rocky Mountain elk. Inspection of the scatterplot suggest the data are consistent with the assumptions of the simple linear regression model.

Revisiting the Elk Continued . . . Let’s examine the residuals more closely. The data, residuals, and the standardized residuals (computed using Minitab) are given on page 761. The largest residual = 38.1397 and the associated standard residual = 1.81294. The boxplots of the residuals and standardized residuals are approximately symmetric with no outliers, so the assumption of normally distributed errors seems reasonable. Neither one of these is surprisingly large. The smallest residual = -38.2661 and the associated standard residual = -1.92313. Notice that the boxplots of the residuals and standardized residuals are nearly identical.

Revisiting the Elk Continued . . . Another way to assess whether the error values are normally distributed is to look at normal probability plots of the residuals or the standardized residuals. (Only one plot is needed.) The pattern in the normal probability plots are reasonably straight, confirming that the assumption of normality of the error distribution is reasonable. The standardized plot is recommended, but it is acceptable to use the unstandardized residual plot if you do not have access to a computer package

A Look at Residual Plots This is a desirable plot in that it exhibits no pattern and has no point that lies far away from the other points. Both of these plots contain points far away from the others. These points can have substantial effects on estimates of a and b as well as other quantities. This plot exhibits a curved pattern which indicates that the fitted model should be changed to incorporate the curvature. In this plot, the standard deviation of the residuals increases as the x-values increase. While a straight-line model might still be appropriate, the best-fit line should be found using weighted least-squares. Consult your local statistician!

Residual plots like the one shown here are desirable. • There are no unusually large residuals since no point lies much outside the horizontal band between -2 and 2. There is no point far to the left or right of the others and there are no pattern of curvature or differences in the variability of the residuals for different height values to indicate that the model assumptions are not reasonable. Newborns and infants have a small trachea, and there is little margin for error when inserting tracheal tubes. Using X-rays of a large number of children ages 2 months to 14 years, researchers examined the relationships between appropriate trachea tube insertion depth and other variables such as height, weight, and age. Below are a scatterplot and a standardized residual plot constructed using data on the insertion depth and height of children (both measured in cm).

Newborns and Infants Problem Continued . . . But consider what happens when the relationship between insertion depth and weight is examined. A careful inspection of these plots suggests that along with curvature, the residuals may be more variable at larger weights. While some curvature is evident in the original scatterplot, it is even more clearly visible in the standardized residual plot. The linear regression model is not appropriate.

Chapter 16

Chapter 16

Presentation Transcript

Chapter 16

Chapter 16

Chapter 16

Chapter 16

Chapter 16

Chapter 16

Chapter 16

Chapter 16

Chapter 16

Chapter 16

CHAPTER 16

Chapter 16

Chapter 16

Chapter 16

Chapter 16

Chapter 16

Chapter 16

Chapter 16

Chapter 16

Chapter 16

Chapter 16

Chapter 16