DSCI 5340: Predictive Modeling and Business Forecasting Spring 2013 – Dr. Nick Evangelopoulos

DSCI 5340: Predictive Modeling and Business ForecastingSpring 2013 – Dr. Nick Evangelopoulos Lecture 1: Introduction to Business Forecasting Review of Simple Regression (Ch. 1-3) Some material taken from: Michael Hand (Willamette University), Biz/Ed

Forecasting “It is far better to foresee even without certainty than not to foresee at all.” Henri Poincaré (1854-1912), polymath and chaos theory pioneer, The Foundations of Science http://www.dilbert.com/2010-07-02/

Why Forecast? • The effectiveness of almost every human endeavor, every public initiative, depends in part upon unknown and uncertain future outcomes – the demand for services, the revenues to fund them. • The quality of decisions about whether or not to engage and at what level improves with the reliability of supporting forecasts.

The two types of Forecasting • Qualitative– seeking opinions on which to base decision making • Consumer panels, focus groups, etc • Quantitative – using statistical data to help inform decision making • Identifying trends • Moving averages – seasonal, cyclical, random • Extrapolation - simple

Costs and Benefits of Forecasting • Benefits: • Aids decision making • Informs planning and resource allocation decisions • If data is of high quality, can be accurate

Costs and Benefits of Forecasting • Costs: • Data not always reliable or accurate • Data may be out of date • The past is not always a guide to the future • Qualitative data may be influenced by peer pressure • Difficulty of coping with changes to external factors out of the business’s control – e.g. economic policy, political developments (9/11?), natural disasters – hurricanes, earthquakes, etc.

Example: Oregon Personal Income Taxes, 1996 – 2001 (see Class Tools > Sitewide > Hand Outs > Public Finance > MultDecompPIT.xls)

Example: Classical Multiplicative Decomposition Conceptual Decomposition: Trend: Long-term growth/decline Cycle: Long-term slow, irregular oscillation Seasonal: Regular, periodic variation w/in calendar year Irregular: Short-term, erratic variation Conceptual Forecast: Forecasting Model:

Example: Classical Multiplicative Decomposition Conceptual Decomposition:

Example: Classical Multiplicative Decomposition

Example: Classical Multiplicative Decomposition: Model Interpretation Model Interpretation Initial, time-zero (1995:Q4) level is $731.92 million Increasing at $18.5 million per quarter Seasonal pattern Peak in Q4 21% over trend Trough in Q3 11% below trend

Example: Classical Multiplicative Decomposition: Forecasts Forecasts

Overview of Simple Regression Analysis

Regression Analysis • Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will study. • Regression analysis is used to predict the value of one variable (the dependent variable) on the basis of other variables (the independent variables). • Dependent variable: denoted Y • Independent variables: denoted X1, X2, …, Xk

Correlation Analysis… • If we are interested only in determining whether a relationship exists, we employ correlation analysis, a technique introduced earlier. • This chapter will examine the relationship between two variables, sometimes called simple linear regression. • Mathematical equations describing these relationships are also called models, and they fall into two types: deterministic or probabilistic.

A linear regression equation illustrating the geometrical interpretations of a and b b = amount of increase in Y y = a + bx Unit increase in X (x2 -x1) = 1

Simple Linear Regressioncorrelation = r Positive r: y increases as x increases r = 1: a perfect positive relationship between y and x

Simple Linear Regression r = -1: a perfect negative relationship between y and x Negative r: y decreases as x increases

Simple Linear Regression r near zero: little or no linear relationship between y and x r near zero: little or no linear relationship between y and x

A Model… • To create a probabilistic model, we start with a deterministic model that approximates the relationship we want to model and add a random term that measures the error of the deterministic component. • Deterministic Model: • The cost of building a new house is about $75 per square foot and most lots sell for about $25,000. Hence the approximate selling price (y) would be: • y = $25,000 + (75$/ft2)(x) • (where x is the size of the house in square feet)

A Model… • A model of the relationship between house size (independent variable) and house price (dependent variable) would be: House Price Building a house costs about $75 per square foot. House Price = 25000 + 75(Size) Most lots sell for $25,000 House size In this model, the price of the house is completely determined by the size.

A Model… • In real life however, the house cost will vary even among the same size of house: Lower vs. Higher Variability House Price 25K$ House Price = 25,000 + 75(Size) + x House size Same square footage, but different price points (e.g. décor options, cabinet upgrades, lot location…)

Simple Linear Regression Model… • A straight line model with one independent variable is called a first order linear model or a simple linear regression model. Its is written as: independent variable dependent variable y-intercept slope of the line error variable

Simple Linear Regression Model… • Note that both and are population parameters which are usually unknown and hence estimated from the data. y rise run =slope (=rise/run) =y-intercept x

Which line has the best “fit” to the data?

Estimating the Coefficients… • In much the same way we base estimates of on , we estimate on b0 and on b1, the y-intercept and slope (respectively) of the least squares or regression line given by: • (Recall: this is an application of the least squares method and it produces a straight line that minimizes the sum of the squared differences between the points and the line)

Least Squares Line… these differences are called residuals This line minimizes the sum of the squared differences between the points and the line… …but where did the line equation come from? How did we get .934 for a y-intercept and 2.114 for slope??

Slope and Correlation Y = $O + $1 X + e • HO: $1 = 0 versus HA: $1¹ 0 • HO: D = 0 versus HA: D¹ 0 where D is the population correlation between X and Y.

Slope and Correlation Warning: High correlation does not imply causality. If a large positive or negative value of the sample correlation coefficient r is observed, it is incorrect to conclude that a change in x causes a change in y. The only valid conclusion is that a linear trend may exist between x and y.

Formulas to use in RegressionNotation • SSxy = ∑ xiyi – ∑xi ∑yi/n (SCP is sometimes used for SSxy.) =Exiyi- n0™ • SSxx = ∑ xi2 – ( ∑xi)2/n =Exi2- n02 • SSyy = ∑yi2 - ( ∑yi)2/n =Eyi2- n™2 (These numbers will always be given to you on the test – you will not have to calculate these.) • SCPxy = SSxy • SSx = SSxx • SSy = SSyy SSyy is also called SST (total sum of squares)

Formulas for the Least Squares Estimate • The values of $0 and $1 that minimize the SSE are given by the following formulas • Slope: $1 = SSxy / SSxx • y-intercept: $0 = ™ - $10 where,

Least Squares Line… • The coefficients b1 and b0 for the least squares line… • …are calculated as:

From Data to Information Statistics Data Information Data Points: Regression Line y = .934 + 2.114x

Sum of Squares for Error (SSE)… • The sum of squares for error is calculated as: • and is used in the calculation of the standard error of estimate: • If is zero, all the points fall on the regression line.

Standard Error… • If is small, the fit is excellent and the linear model should be used for forecasting. If is large, the model is poor… But what is small and what is large?

Analysis of Variance Explained sum of squares = regression sum of squares = SSR Unexplained sum of squares = error sum of squares = SSE Total sum of squares = SSyy =SST

Analysis of Variance for Regression An omnibus or global test of the overall contribution of the set of driver variables to the prediction of the response variable is carried out via the analysis of variance (ANOVA). A summary table for the ANOVA of regression follows: Source of Degrees of Sum of Mean Fcalc Fcrit Variation (SV) Freedom (df) Squares (SS) Square (MS) Regression p SSR MSR MSR/MSE F,p,n-(p+1) Residual n-(p+1) SSE MSE Total n-1 SST Where F,p,n-(p+1) is the value of F with p numerator df and n-(p+1) denominator df that places  in the upper tail of the distribution.

Analysis of Variance for Regression • In the ANOVA table we have the following: • SSR = Sum of Squares due to Regression • SSE = Sum of Squares due to Error or Residual • SST = Sum of Squares Total • MSR = SSR/p = Mean Square Regression • MSE = SSE/[n-(p+1)] = Mean Square Error or Residual • The sums of squares are derived from the algebraic identity: • (Yi - Y)2 = (Yi - Y)2 + (Yi -Yi)2 • That is: SST = SSR + SSE So that R2 = SSR/SST represents the proportion of variation in Y that is explained by the behavior of the driver variables. R2 is the coefficient of determination. ^ ^

Coefficient of Determination… • Tests thus far have shown if a linear relationship exists; it is also useful to measure the strength of the relationship. This is done by calculating the coefficient of determination – R2. • The coefficient of determination is the square of the coefficient of correlation (r), hence R2 = (r)2

Coefficient of Determination… • The coefficient of determination is r2 = (SSyy - SSE) = 1 - SSE (SSyy)SSyy • It represents the proportion of the sum of squares of deviations of the y values about their mean that can be attributed to a linear relationship between y and x. (In simple linear regression, it may also be computed as the square of the coefficient of correlation r.)

PracticalInterpretation of the Coefficient of Determination, R2 • About 100(R2)% of the sample variation in y (measured by the total sum of squares of deviations of the sample y values about their mean ™) can be explained by(or attributed to ) using x to predict y in the straight-line model.

Ex = 268 0 = 26.8 Ex2 = 7668 Ey = 27.73 ™ = 2.773 Ey2 = 83.8733 Example: Data & Calculations

Example: Data & Calculations • We need to calculate SSxy, SSxx, and SSyy as follows (equivalent formula on page 127 using R-square): • SSxy= 57.456 • SSxx = 485.6 • SSyy = 6.97801 • Then, the coefficient of correlationis

Although (a) has a large slope and (b) has a small slope, both are scatter diagrams for r = 0.9

One thing to keep in mind is that statistical significance does not always imply practical significance. In other words, rejection of Ho: $1 = 0 (statistical significance) does not mean that precise prediction (practical significance) follows. It does demonstrate to the researcher that , within the sample data at least, this particular independent variable has an association with the dependent variable. Interpretation of beta coefficients

A (1 - ") Ÿ 100% confidence interval for $1 is From: b1 - t"/2, n -2sb1to: b1 + t"/2, n -2sb1 For b1 = .354 and sb1 = .0797 and using t.05,8 = 1.860, the resulting confidence interval is .354 - (1.860)(.0797) to .354 + (1.860)(.0797) = .354 - .148 to = .354 + .148 = .206 to = .502 Confidence Interval for $1

Confidence Interval for $1 • So we are 90% confident that the value of the estimated slope (b1 = .354) is within .148 of the actual slope, $1. • The large width of this interval is due in part to the lack of information (small sample size) used to derive the estimates; a larger sample would decrease the width of this confidence interval.

Analysis of Residuals

Geometric representation of residuals • Scatter diagram and least-squares line example

Geometric representation of residuals • Unexplained deviations (from the observed points to the line): • (yi - í)

DSCI 5340: Predictive Modeling and Business Forecasting Spring 2013 – Dr. Nick Evangelopoulos