1 / 84

Some further problems with regression models

Some further problems with regression models. Regression analysis has two fundamental tasks :. 1. Estimation: computing from sample data reliable estimates of the numerical values of the regression coefficients β j (j = 0, 1, …, k) , and hence of the population regression function.

tiana
Télécharger la présentation

Some further problems with regression models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Some further problems with regressionmodels Regression analysis has two fundamental tasks: 1. Estimation: computing from sample data reliable estimates of the numerical values of the regression coefficients βj (j = 0, 1, …, k), and hence of the population regression function. 2. Inference: using sample estimates of the regression coefficients βj (j = 0, 1, …,k) to test hypotheses about the population values of the unknown regression coefficients -- i.e., to infer from sample estimates the true population values of the regression coefficients within specified margins of statistical error.

  2. CONFIDENCE INTERVALS FOR PARAMETERS Let the population regression model be Let b0, b1 , . . , bK be the least squares estimates of the population parameters and sb0, sb1, . . ., sbK be the estimated standard deviations of the least squares estimators (square roots from the diagonal items of variance-covariance matrix). Then if the standard regression assumptions hold and if the error terms i are normally distributed, the random variables corresponding to are distributed as Student’s t with (n – k – 1) degrees of freedom.

  3. Did you hear about the statistician who was thrown in jail? - He now has zero degrees of freedom. Did you hear about the statistician who got married? - He now has less degrees of freedom.

  4. CONFIDENCE INTERVALS FOR PARAMETERS If the regression errors i , are normally distributed and the standard regression assumptions hold, the 100(1 - )% confidence intervals for the partial regression coefficients j, are given by and the random variable t(n – k - 1) follows a Student’s t distribution with (n – k - 1) degrees of freedom.

  5. STANDARD ERROR FOR REGRESSION PARAMETERS Y-weekly salary ($) X1 –length of employment(months) X2-age (years) Se2

  6. STANDARD ERROR FOR REGRESSION PARAMETERS variance-covariance matrix β0differs from b0 for 57 units. β1differs from b1 0,15 units. β2differs from b2 for 1,77 units.

  7. CONFIDENCE INTERVALS FOR PARAMETERS t-Student statistic for the level of significance 0,05 and df=13 is 2,160 Confidence interval for β0 Interval with the lower limit 338,3508 $ and the upper limit 585,3495$ covers the unknown value of parameter β0 (for population) with 95% probability.

  8. CONFIDENCE INTERVALS FOR PARAMETERS t-Student statistic for the level of significance 0,05 and df=13 is 2,160 Confidence interval for β1 Interval with the lower limit 0,3526 $ and the upper limit 0,9898$ covers the unknown value of parameter β1(for population) with 95% probability.

  9. CONFIDENCE INTERVALS FOR PARAMETERS t-Student statistic for the level of significance 0,05 and df=13 is 2,160 Confidence interval for β2 Interval with the lower limit –5,2124$ and the upper limit 2,4456$ covers the unknown value of parameter β2(for population) with 95% probability.

  10. Question: How many statisticians does it take to change a lightbulb? Answer: 1-3, alpha = 0.05

  11. TESTING ALL THE PARAMETERS OF A MODEL Consider the multiple regression model To test the null hypothesis against the alternative hypothesis at a significance level  we can use the decision rule where F , k, n–k–1 is the critical value of F the computed F k,n–k–1 follows an F distribution with numerator degrees of freedom k and denominator degrees of freedom (n–k–1)

  12. TESTING ALL THE PARAMETERS OF A MODEL F-value computed from the sample F-value from F tables F , k, n–k–1= F0,05, 2, 13= 3,81 F comp >Fα,k,n-k-1 28,448>3,81 We reject the null hypothesis. At least one parameter is statistically significant.

  13. TESTING SINGLE PARAMETER OF A MODEL If the regression errors are normally distributed and the standard least squares assumptions hold, the following tests have significance level : To test either null hypothesis against the alternative the decision rule is To test either null hypothesis against the alternative the decision rule is

  14. TESTING SINGLE PARAMETER OF A MODEL TESTING PARAMETERS OF A MODEL INDIVIDUALLY To test either null hypothesis against the two-sided alternative the decision rule is

  15. TESTING SINGLE PARAMETER OF A MODEL TESTING PARAMETERS OF A MODEL INDIVIDUALLY Y-weekly salary ($) X1 –length of employment(months) X2-age (years) The null hypothesis can better be stated as: independent variable Xj does not contribute to the prediction of Y, given that other independent variables already have been included in the model The alternative hypothesis can better be stated as: independent variable Xj does contribute to the prediction of Y, given that other independent variables already have been included in the model

  16. TESTING SINGLE PARAMETER OF A MODEL TESTING PARAMETERS OF A MODEL INDIVIDUALLY Y-weekly salary ($) X1 –length of employment(months) X2-age (years) The length of employment(X1) does not contribute to the prediction of weekly salary (Y), given that the age (X2) already have been included in the model. The length of employment(X1) does contribute to the prediction of weekly salary (Y), given that the age (X2) already have been included in the model. t-value computed from the sample t-value from t-Student tables is 2,16 We reject the null hypothesis.The length of employment(X1) does contribute to the prediction of weekly salary (Y), given that the age (X2) already have been included in the model.

  17. TESTING SINGLE PARAMETER OF A MODEL TESTING PARAMETERS OF A MODEL INDIVIDUALLY Y-weekly salary ($) X1 –length of employment(months) X2-age (years) The age(X2) does not contribute to the prediction of weekly salary (Y), given that the length of employment (X1) already have been included in the model. The age(X2) does contribute to the prediction of weekly salary (Y), given that the length of employment (X1) already have been included in the model. t-value computed from the sample t-value from t-Student tables is 2,16 We don’t reject the null hypothesis.The age(X2)does not contribute to the prediction of weekly salary (Y), given that the length of employment (X1) already have been included in the model.

  18. LINEARITY LINEARITY is essential for calculation of multivariate statistics due to the basis upon the general linear model, and the assumption of multivariate normality which implies that there is linearity between all pairs of variables, with significance tests based upon that assumption. Non-linearity may be diagnosed from bivariate scatterplots between pairs of variables or from a residual plot, with predicted values  of the dependent variable versus the residuals. Residual plots may demonstrate: failure of normality, nonlinearity, and heteroscedasticity. Linearity between two variables may be assessed through observation of bivariate scatterplots. When both variables are normally distributed and linearly related, the scatterplot is oval shaped, if one of the variables is nonnormal then the scatterplot is not oval.

  19. LINEARITY Linearity is the assumption that there is a straight line relationship between variables. Expected distribution of residuals for a linear model with normal distribution of residuals (errors).

  20. LINEARITY Examples of non-linear distributions:

  21. LINEARITY The positive residuals should be named with “a”, the negative – “b”. We should count the number of positive (n1) and negative (n2) residuals. The run is a group of residuals of the same sign (“a” or “b”). We need number of runs for our sample. This is called S. From the runs test tables we need S1 and S2 for the level of significance and n1 and n2 degrees of freedom. Decision rule: Reject H0 if S=<S1 or S>=S2. The form of model (linear form) is not proper Don’t reject H0 if S1<S<S2. The form of model is proper

  22. LINEARITY

  23. NORMALITY

  24. NORMALITY The underlying assumption of most multivariate analysis and statistical tests is the assumptions of multivariate normality. Multivariate normality  is the assumption that all variables and all combinations of the variables are normally distributed. When the assumption is met the residuals are normally distributed and independent, the differences between predicted and obtained scores (the errors) are symmetrically distributed around a mean of zero and there is no pattern to the errors. Screening for normality may be undertaken in either statistical or graphical methods.

  25. NORMALITY The underlying assumption of most multivariate analysis and statistical tests is the assumptions of multivariate normality. Multivariate normality  is the assumption that all variables and all combinations of the variables are normally distributed. When the assumption is met the residuals are normally distributed and independent, the differences between predicted and obtained scores (the errors) are symmetrically distributed around a mean of zero and there is no pattern to the errors. Screening for normality may be undertaken in either statistical or graphical methods.

  26. NORMALITY

  27. NORMALITY If nonnormality is found in the residuals or the actual variables transformation may be considered. Transformations are recommended as a remedy for outliers, breaches in normality, non-linearity, and lack of homoscedasticity. Although recommended, be aware of the change to the data, and the adaptation to the change which must be implemented for interpretation of results. Remember to check the transformation  for normality after application.

  28. NORMALITY How to assess and deal with problems: Statistically: • examine skewness and kurtosis. When a distribution is normal both skewness and kurtosis are zero. Kurtosis is related to the peakedness of a distribution, either too peaked or too flat. Skewness is related to the symmetry of the distribution, the location of the mean of the distribution, a skewed variable is a variable whose mean is not in the center of the distribution. Tests of significance for skewness and kurtosis test the obtained value against a null hypothesis of zero. Although normality of all linear combinations is desirable to ensure multivariate normality it is often not testable. Therefore, normality assessed through skewness and kurtosis of individual variables may indicate variables which may require transformation.

  29. NORMALITY How to assess and deal with problems: Graphically: • view distributions of the data. • examine the residuals, plot expected  values vs. obtained scores (predicted vs. actual).

  30. NORMALITY • view probability plots, where scores are ranked and sorted, an expected normal value is compared for the actual normal value for each case. If a distribution is normal, the points for all cases fall along the line running diagonal from lower left to upper right. Deviations from normality shift the points away from the diagonal.

  31. NORMALITY Jarque-Berr test – for normality of residuals Hypotheses: H0; residuals are normally distributed H1; residuals are not normally distributed Statistic from the sample is given by the formula: where Reject H0 if

  32. NORMALITY

  33. HETEROSCEDASTICITY Another long word (it means "different variabilities"). Regression assumes that the scatter of the points about the regression line is the same for all values of each independent variable. Quite often, the spread will increase steadily as one of the independent variables increases, so we get a fan-like scattergram if we plot the dependent variable against that independent variable. Another way of detecting heteroscedasticity (and also outlier problems) is to plot the residuals against the fitted values of the dependent variable. We may be able to deal with this by transforming one or more variables.

  34. HETEROSCEDASTICITY Our residuals need to be homoscedastic (of equal variability)!

  35. HETEROSCEDASTICITY The appearance of heteroscedastic errors can also result if a linear regression model is estimated in circumstances where a non linear model is appropriate. When the process is such that a non linear model is appropriate we should make the transformations and estimate a non linear model. Taking logarithms will dampen the influence of large observations, especially if the large observations result from percentage growth from previous states – an exponential growth pattern. The resulting model will often appear to be free from heteroscedasticity. Non linear models are often appropriate when the data under study are time series of economic variables, such as consumption, income, and money, that tend to grow exponentially over time.

  36. HETEROSCEDASTICITY and NORMALITY

  37. HOMOSCEDASTICITY Harrison-McCabe test – for models with residuals normally distributed Hypotheses: Statistic from the sample is given by the formula: where: b- Harrison – McCabe statistic; ei - residuals; n – sample size; m - number of observation (1 < m < n )

  38. HOMOSCEDASTICITY Observation numbered with m should be found in the following way: • m=n/2 if n is an even number or m=(n-1)/2 if n is an odd number, where there’s no tendency in residuals variability, • if absolute residual values are increasing (or decreasing) and than decreasing (or increasing) we should choose for m an observation with the highest (or lowest) residual’s absolute value. We must remember that ·         m > k +1 ·         n – m > k + 1 where k – number of estimators.

  39. HOMOSCEDASTICITY Next step is to find F-statistic value at the level of significance and the following number of the degrees of freedom: • F1 for r1=n-m and r2= m-(k+1) df; • F2 for r1=n-m-(k+1) and r2=m df Critical value of Harrison – McCabe test is given by the formula

  40. HOMOSCEDASTICITY Decision rule: we reject the null hypothesis if The residuals are heteroscedastic we don’t reject the null hypothesis if The residuals are homoscedastic we can’t make any decision (the test is inconclusive) if

  41. AUTOCORRELATION In this section we will examine the effects on the regression model if the error terms in a regression model are correlated with one another. Up to this point we have assumed that the random errors for our model are independent. However, in many business and economic problems we use time series data. When time series data are analyzed, the error term represents the effect of all factors other than the independent variables, that influence the dependent variable. In time series data the behavior of many of these factors might be quite similar over several time periods and the result would be a correlation between the error terms that are close together in time.

  42. AUTOCORRELATION

  43. AUTOCORRELATION The classical regression model includes an assumption about the independence of the disturbances from observation to observation: E(eiej)=0 for ij [the variance-covariance matrix is diagonal] If this assumption is violated the errors in one time period are correlated with their own values in other periods and there is the problem of autocorrelation - also sometimes referred to as serial correlation - strictly autocorrelated errors or disturbances. All time series variables can exhibit autocorrelation, with the values in a given period depending on values of the same series in previous periods. But the problem of autocorrelation concerns such dependence in the disturbances.

  44. AUTOCORRELATION First order autocorrelation In its simplest form the errors or disturbances in one period are related to those in the previous period by a simple first-order autoregressive process: et =  et-1 + t -1 <  < 1 If  > 0 we have positive autocorrelation, with each error arising as a proportion of last period's error plus a random shock (or innovation). If  < 0 it corresponds to negative autocorrelation.

  45. AUTOCORRELATION We shall begin by assuming that any autocorrelation in the errors follows such a first order process. We would say that e is AR(1). Later, however, we must consider the possibility that the errors follow an AR(p) process where p>1 , i.e. et = 1 et-1 + 2 et-2 + .......+ pet-p + t

  46. AUTOCORRELATION The sources of autocorrelation Each of the following types of misspecification can result in autocorrelated disturbances: • incorrect functional form • inappropriate time periods • inappropriately "filtered" data (seasonal adjustment)

  47. AUTOCORRELATION The consequences of autocorrelation The variances of the parameter estimates will be affected. Consequently the standard errors of the parameter estimators and t-values will also be affected. The variance of the error (Se2) term will be underestimated so that R squared will be exaggerated. The F-test formulae will also be incorrect.

  48. AUTOCORRELATION Detecting autocorrelation - graphical method

More Related