1 / 20

Linear Models

Linear Models. Alan Lee Sample presentation for STATS 760. Contents. The problem Typical data Exploratory Analysis The Model Estimation and testing Diagnostics Software A Worked Example. The Problem.

Télécharger la présentation

Linear Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linear Models Alan Lee Sample presentation for STATS 760

  2. Contents • The problem • Typical data • Exploratory Analysis • The Model • Estimation and testing • Diagnostics • Software • A Worked Example

  3. The Problem • To model the relationship between a continuous variable Y and several explanatory variables x1,… xk. • Given values of x1,… xk , predict the value of Y.

  4. Typical Data • Data on 5000 motor vehicle insurance policies having at least one claim • Variables are • Y: log(amount of claim) • x1: sex of policy holder • x2: age of policy holder • x3: age of car • x4: car type (1-20 score, 1=Toyota Corolla, 20 = Porsche)

  5. Exploratory Analysis • Plot Y against other variables • Scatterplot matrix • Smooth as necessary

  6. Log claims vs car age

  7. The Model • Relationship is modelled using the conditional distribution of Y given x1,…xk. (covariates) • Assume conditional distribution of Y is N(m,s2) where m depends on the covariates.

  8. The Model (2) • If all covariates are “continuous”, then m = b0 + b1x1 + ... + bkxk + e • In addition, all Y’s are assumed independent.

  9. Estimation and Testing • Estimate the b’s • Estimate the error variance s2 • Test if b’s = 0 • Check goodness-of-fit

  10. Least Squares Estimate b’s by values that minimize the sum of squares (Least squares estimates, LSE’s) Minimizingvalues are the solution of the Normal Equations. Minimum value is the residual sum of squares (RSS) s2 estimated by RSS/(n-k-1)

  11. Goodness of Fit • Goodness of fit measured by R2: 0£R2£1 (why?) R2=1 iff perfect fit (data all on a plane)

  12. Prediction • Y predicted by where the hat indicates the LSE • Standard errors: 2 kinds, one for mean value of Y for a set of x’s, the other for an individual y for a particular set of x’s

  13. Interpretation of Coefficients • The LSE for variable xj is the amount we expect y to increase if xjis increased by a unit amount, assuming all the other x’s are held fixed • The test for bj = 0 is that variable j makes no contribution to the fit, given all other variables are in the model

  14. Checking Assumptions (1) • Tools are residuals, fitted values and hat matrix diagonals • Fitted values • Residuals • Hat matrix diagonals (Measure the effect of an observation on its fitted value)

  15. Checking Assumptions (2) Assumptions are • Mean linear in the x’s (plot residuals v fitted values, partial residual plot, CERES plots) • Constant variance (plot squared residuals v fitted values) • Independence (time series plot, residuals v preceding) • Normality/outliers (normal plot)

  16. Remedial Action • Transform variables • Delete outliers • Weighted least squares

  17. Software • SAS: PROC REG, PROC GLM • R-Plus, R: lm • Usage: lm(model formula, dataframe, weights,…)

  18. Model Formula • Assume k=3 • If x1,x2,x3 all continuous, fit a plane Y~x1 + x2 + x3 • If x1 categorical (eg gender) and x2, x3 continuous, fit a different plane/curve in x2,x3 for each level of x1: Y~x1 + x2 + x3(planes parallel) Y~x1 + x2 + x3 + x1:x2 + x1:x3(planes different)

  19. Insurance Example (1) • cars.lm<-lm(logad~poly(CARAGE,2)+PRIMAGEN+gender) • summary(cars.lm) Call: lm(formula = logad ~ poly(CARAGE, 2) + PRIMAGEN + gender) Residuals: Min 1Q Median 3Q Max -3.9713 -0.4610 0.2376 0.8092 3.9767 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.986329 0.077533 77.210 < 2e-16 *** poly(CARAGE, 2)1 -7.308946 1.229095 -5.947 2.92e-09 *** poly(CARAGE, 2)2 -8.038865 1.232416 -6.523 7.58e-11 *** PRIMAGEN 0.004014 0.001339 2.999 0.00272 ** gender 0.015633 0.041474 0.377 0.70624 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 1.226 on 4995 degrees of freedom Multiple R-Squared: 0.01611, Adjusted R-squared: 0.01532 F-statistic: 20.45 on 4 and 4995 DF, p-value: < 2.2e-16

  20. Insurance Example (2) > plot(cars.lm)

More Related