Chapter 14

Chapter 14 Regression and Forecasting Models

Introduction • Many decision-making applications depend on a forecast of some quantity. • Here are some examples: • When a company plans its ordering or production schedule for a product, it must forecast the customer demand for this product so that it can stock appropriate quantities—neither too much nor too little. • When an organization plans to invest in stocks, bonds, or other financial instruments, it typically attempts to forecast movements in stock prices and interest rates.

Introduction continued • Many forecasting methods are available, and all practitioners have their favorites. • To say the least, there is little agreement among practitioners or theoreticians as to the best forecasting method. • The methods can generally be divided into three groups: • Judgmental methods, • Regression methods, and • Extrapolation methods.

Introduction continued • Regression models, also called causal models, forecast a variable by estimating its relationship with other variables. • The technique of regression is extremely popular, due to its flexibility and power. • Regression can estimate relationships between time series variables or cross-sectional variables (those that are observed at a single point in time), and it can estimate linear or nonlinear relationships.

Introduction continued • Extrapolation methods, also called time series methods, use past data of a time series variable - and nothing else - to forecast future values of the variable. • Many extrapolation methods are available, including moving averages and exponential smoothing. • All extrapolation methods search for patterns in the historical series and then attempt to extrapolate these patterns into the future.

Overview of regression models • Regression analysis is the study of relationships between variables. • It is one of the most useful tools for a business analyst because it applies to so many situations. • Some potential uses of regression analysis in business address the following questions: • How do wages of employees depend on years of experience, years of education, and gender? • How does the current price of a stock depend on its own past values, as well as the current and past values of a market index?

Overview of regression models continued • Each of these questions asks how a single variable, such as selling price or employee wages, depends on other relevant variables. • If you can estimate this relationship, you can better understand how the world operates and also do a better job of predicting the variable in question.

Overview of regression models continued • Regression analysis can be categorized in several ways. • One categorization is based on the type of data being analyzed. • There are two basic types: cross-sectional data and time series data. • Cross-sectional data are usually data gathered from approximately the same period of time from a cross section of a population. • In contrast, time series studies involve one or more variables that are observed at several, usually equally spaced, points in time.

Overview of regression models continued • In every regression study, the goal is to explain or predict a particular variable. This is called the dependentvariable (or the response variable) and is often denoted generically as Y. • To help explain or predict the dependent variable, one or more explanatory variables are used. • These variables are also called independentvariables or predictor variables, and they are often denoted generically as Xs.

Overview of regression models continued • A second categorization of regression analysis involves the number of explanatory variables in the analysis. • If there is a single explanatory variable, the analysis is called simple regression. • If there are several explanatory variables, it is called multiple regression. • There are important differences between simple and multiple regression. • The primary difference, as the name implies, is that simple regression is simpler, from calculations to interpretation.

The least-squares line • The basis for regression is a fairly simple idea. If you create a scatterplot of one variable Y versus another variable X, you obtain a swarm of points that indicates any possible relationship between these two variables. • The terms scatterplot, scatter chart, and XY chart are all used to describe the same thing. • To quantify this relationship, you try to find the best-fitting line (or curve) through the points in the graph.

The least-squares line continued • Consider the scatterplot below. • The line shown is one possible fit. It appears to be a reasonably good fit, but a numerical measure of goodness-of-fit is needed so that this fit can be compared with the fits of other possible lines.

The least-squares line continued • The measure commonly used is the sum of squared residuals. • Here, a residual is defined as the vertical distance from a point to the line, as illustrated for points A and B. • Put differently, a residual is a prediction error. It is the difference between an observed Y and the predicted Y from the regression line. • The least-squares regression line minimizes the sum of squared residuals.

Prediction and fitted values • After you find the least-squares line, you can use it for prediction. • Geometrically, this is easy. Given any value of X, you predict the corresponding value of Y to be the height of the line above this X. • The predicted Y value is called the fitted value. • In contrast, the height of any point is the actual value of Y for this point. • This implies that: • Residual = Actual value - Fitted value

Measures of goodness-of-fit • Besides the sum of squared residuals, other measures of goodness-of-fit typically are quoted in regression analyses. • The standard error of estimate is obtained by averaging and then taking the square root, as shown in the following formula. • The standard error of estimate is useful because it provides an estimate of the magnitude of the prediction errors you are likely to make.

Measures of goodness-of-fit continued • Another goodness-of-fit measure is called the multiple R, defined as the correlation between the actual Y values and the fitted Y values. • In general, a correlation is a number between -1 an +1 that measures the goodness-of-fit of the linear relationship between two variables. • A correlation close to -1 or +1 indicates a tight linear fit, whereas a correlation close to 0 tends to indicate no linear fit—usually a shapeless swarm of points.

Measures of goodness-of-fit continued • In regression, you want the fitted Y values to be close to the actual Y values, so you want a scatterplot of the actual values versus the fitted values to be close to a 45° line, with the multiple R close to +1.

Measures of goodness-of-fit continued • If you square the multiple R, you get a measure that has a more direct interpretation. • This measure is known simply as R-square. It represents the percentage of the variation of the Y values explained by the Xsincluded in the regression equation.

Simple regression models • In this section, we discuss how to estimate the regression equation for a dependent variable Y based on a single explanatory variable X. • The common terminology is that “Y is regressed on X.” • This is the equation of the least-squares line passing through the scatterplot of Y versus X. • Because we are estimating a straight line, the regression equation is of the form Y = a + bX, where, as in basic algebra, a is called the intercept and b is called the slope.

Regression-based trend models • A special case of simple regression is when the only explanatory variable is time, usually labeled t (rather than X). • In this case, the dependent variable Y is a time series variable, such as a company’s monthly sales, and the purpose of the regression is to see whether this dependent variable follows a trend through time. • With a linear trend line, the variable changes by a constant amount each period. • With an exponential trend line, the variable changes by a constant percentage each period. • Example 14.1 demonstrates how easily trends can be estimated with Excel.

Caution about exponential trend lines • Exponential trendlines are often used in predicting sales and other economic quantities. • However, we urge caution with such predictions. It is difficult for any company to sustain a given percentage increase year after year.

Using an explanatory variable other than time • You are not restricted to using time as the explanatory variable in simple regression. • Any variable X that is related to the dependent variable Y is a candidate. • Example 14.2 illustrates one such possibility. • It shows how you can still take advantage of Excel’s Add Trendlineoption, even though the resulting trend line is not what you usually think of with trend—a trend through time.

Multiple regression models • When you try to explain a dependent variable Y with regression, there are often a multitude of explanatory variables to choose from. • In this section, we explore multiple regression, where the regression equation for Y includes a number of explanatory variables, the Xs. • The general form of this equation is shown in the box.

Multiple regression models continued • In the previous equation, ais again the Y-intercept, and b1through bk are the slopes. • Collectively, aand the bs are called the regression coefficients. • Each slope coefficient is the expected change in Y when that particular X increases by one unit and the other Xs in the equation remain constant. • For example, b1 is the expected change in Y when X1increases by one unit and the other Xs in the equation, X2 through Xk, remain constant. • We illustrate these ideas in the following example.

A note about adjusted R-square • You are probably wondering what the adjusted R-square value means in the multiple regression output. • Although it has no simple interpretation like R-square (percentage of variation explained), it is useful for comparing regression equations. • The problem with R-square is that it can never decrease when extra explanatory variables are added to a regression equation. However, there ought to be some penalty for adding variables that don’t really belong. • This is the purpose of adjusted R-square, which acts as a monitor. If you add one or more extra explanatory variables to an already existing equation, adjusted R-square can decrease.

Incorporating categorical variables • The goal of regression analysis is to find good explanatory variables that explain some dependent variable Y. • Often these explanatory variables are quantitative, such as the Units Produced variables in the two previous examples. • However, there are often useful qualitative categorical variables that help explain Y, such as gender (male or female), region of country (east, south, west, or north), quarter of year (Q1, Q2, Q3, or Q4), and so on.

Incorporating categorical variables continued • Because regression works entirely with numbers, categorical variables must typically be transformed into numeric variables that can be used in a regression equation. • This is usually done by creating dummy variables, also called 0–1 variables or indicator variables. • For any categorical variable, you create a dummy variable for each possible category. Its value is 1 for each observation in that category, and it is 0 otherwise.

Incorporating categorical variables continued • There is one technical rule you must follow when using dummy variables in regression. • If a categorical variable has m categories, you should use only m – 1 of the m possible dummy variables in the regression equation. • You can omit any one of the dummies, which becomes the reference (or base) category. • Example 14.4, another extension of Example 14.2, illustrates the use of dummy variables.

A caution about regression assumptions • In this brief introduction to regression, we have discussed only the basic elements of regression analysis, and we have omitted many of the technical details that can be found in more complete statistics books. • In particular, we have not discussed what can go wrong if various statistical assumptions behind regression analysis are violated. • Although there is not room here for a complete discussion of these assumptions and their ramifications, we briefly state a few cautions you should be aware of.

Multicollinearity • In the best of worlds, the explanatory variables, the Xs, should provide nonoverlappinginformation about the dependent variable Y. They should not provide redundant information. • However, sometimes redundancy is difficult to avoid. • When you do include Xs that are highly correlated with one another, you introduce a problem called multicollinearity.

Multicollinearity continued • The problem is that when Xs are highly correlated with one another, it is virtually impossible to sort out their separate influences on Y. • This inability to sort out separate effects can even lead to “wrong” signs on the regression coefficients. • Therefore, the presence of multicollinearity makes regression equations difficult to interpret. • Fortunately, however, multicollinearityis not a problem if you are concerned only with prediction of new Ys.

Nonlinear relationships • If scatterplots of Y versus the various Xs indicate any nonlinear relationships, a linear relationship will almost certainly lead to a poor fit and poor predictions. • Fortunately, as with the exponential trend line, there are often nonlinear transformations of Y and/or the Xsthat “straighten out” the scatterplots and allow you to use linear regression. • We will not discuss such transformations here. We simply warn you that if the scatterplots of the original variables do not appear to be linear, you should not blindly proceed to estimate a linear relationship.

Nonconstant error variance • One assumption of regression is that the variation of the Y values above any values of the Xsis the same, regardless of the particular values of the Xs chosen. • Sometimes this assumption is clearly violated. • Typically, nonconstant error variance appears in a scatterplot as a fan-shaped swarm of points. • We simply alert you to this possibility and suggest that you obtain expert help if you spot an obvious fan shape.

Autocorrelation of residuals • Autocorrelation means that a variable’s values are correlated with its own previous values. • This typically occurs in time series variables. • It is not difficult to detect autocorrelation of residuals (although we will not discuss the measures for doing so), but it is much more difficult to deal with autocorrelation appropriately. • Again, you should consult an expert if you believe your time series analysis is subject to autocorrelation.

Overview of time series models • To this point, we have discussed regression as a method of forecasting. • Because of its flexibility, regression can be used equally well for time series variables and for cross-sectional variables. • From here on, however, we focus exclusively on time series variables, and we discuss nonregression approaches to forecasting. • All of these approaches fall under the general umbrella of extrapolation methods.

Overview of time series models continued • With an extrapolation method, you form a time series plot of the variable Y that you want to forecast, analyze any patterns inherent in this time series plot, and extrapolate these patterns into the future. • You do not use any other variables—the Xs from the previous section—to forecast Y; you use only past values of Y to forecast future values of Y. • The idea is that history tends to repeat itself. Therefore, if you can discover the patterns in the historical data, you ought to obtain reasonably good forecasts by projecting these historical patterns into the future.

Components of time series • A time series variable Y typically contains one or more components. • These components are called the trend component, the seasonal component, the cyclic component, and the random (or noise) component.

Trend component • If the observations increase or decrease regularly over time, we say that the time series has a trend. • The graphs below illustrate several possible trends.

Seasonal component • Many time series have a seasonal component. • An important aspect of the seasonal component is that it tends to be predictable from one year to the next. That is, the same seasonal pattern tends to repeat itself every year.

Cyclic component • The third component of a time series is the cyclic component. • By studying past movements of many business and economic variables, it becomes apparent that business cycles affect many variables in similar ways.

Random (noise) component • The final component in a time series is called the random component, or simply noise. • This unpredictable component gives most time series graphs their irregular, zigzag appearance. • Usually, a time series can be determined only to a certain extent by its trend, seasonal, and cyclic components. Then other factors determine the rest.

Measures of forecast error • When you use any extrapolation method, you build a model to track the observed historical data, and then you use this model to forecast future values of the data. • The only way you can judge whether the future forecasts are likely to be any good is to measure how well the model tracks the historical data. • Time series analysts typically use several measures.

Measures of forecast error continued • The three measures of forecasting accuracy typically used are MAE (mean absolute error), RMSE (root mean square error), and MAPE (mean absolute percentage error). • These are given by the following formulas, where N is the number of historical periods for which the model provides forecasts.

Measures of forecast error continued • RMSE is similar to a standard deviation in that the errors are squared; because of the square root, its units are the same as those of the original variable. • MAE is similar to RMSE except that absolute values of errors are used instead of squared errors. • MAPE is probably the easiest measure to understand because it does not depend on the units of the original variable; it is always stated as a percentage.

Measures of forecast error continued • Depending on the forecasting software used, one or more of these measures will typically be reported. • Fortunately, models that make any one of these measures small tend to make the others small as well, so that you can choose whichever measure you want to focus on.

Measures of forecast error continued • One caution is in order, however. The measures MAE, RMSE, or MAPE are used to see how well the forecasting model tracks historical data. • But even if these measures are small, there is no guarantee that future forecasts will be accurate.

Moving averages models • Perhaps the simplest and one of the most frequently used extrapolation methods is the method of moving averages. • Very simply, the forecast for any period with this method is the average of the observations from the past few periods. • To implement the moving averages method, you must first choose a span, the number of terms in each moving average.

Moving averages models continued • The role of the span is important. If the span is large - say, 12 months - then many observations go into each average, and extreme values have relatively little effect on the averages. • The resulting series of forecasts will be much smoother than the original series. (For this reason, the moving average method is called a smoothing method.) • If the span is small—say, three months—then extreme observations have a larger effect on the averages, and the forecast series will be much less smooth.

Moving averages models continued • What span should you use? This requires some judgment. • The following example illustrates the use of moving averages on a series of weekly sales. • We continue to take advantage of the StatTools add-in, which includes procedures for creating time series graphs and implementing moving averages and exponential smoothing methods.

Exponential smoothing models • The main criticism of the moving averages method is that it puts equal weight on each value in a typical moving average. • Exponential smoothing is a method that addresses this criticism. • It bases its forecasts on a weighted average of past observations, with more weight put on the more recent observations.

Chapter 14

Chapter 14

Presentation Transcript

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14.

Chapter 14

Chapter 14

CHAPTER 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14