1 / 23

Session 3: Some Data Operations

Session 3: Some Data Operations. The Swiss Credit Card Data: Bagging The Stormer Viscometer Data: Non-linear Regression, Bootstrap Confidence Intervals. Swiss Credit Card Data. Main response: Does this customer own a credit card?

fairly
Télécharger la présentation

Session 3: Some Data Operations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Session 3: Some Data Operations The Swiss Credit Card Data: Bagging The Stormer Viscometer Data: Non-linear Regression, Bootstrap Confidence Intervals

  2. Swiss Credit Card Data • Main response: Does this customer own a credit card? • Predictors: various personal details (age, sex, language, ...) and account use behaviour • Problem: Build a predictor for credit card ownership • Strategies: • Logistic regression with stepwise variable selection, penalized by BIC • Trees • Trees with discrete bagging • Trees with smooth bagging

  3. Split data into "train" and "test" sets SwissCC <- read.csv("SwissCC.csv") ### set seed set.seed(300424) ### k <- nrow(SwissCC) isp <- sample(k, floor(0.4*k)) CCTrain <- SwissCC[isp, ] CCTest <- SwissCC[-isp, ] require(ASOP) Save(SwissCC, CCTrain, CCTest, isp)

  4. Stepwise Logistic Regression ### try a logistic regression CCLR <- glm(credit_card_owner ~ 1, binomial, CCTrain) ### construct an upper list as a formula m <- match("credit_card_owner", names(CCTrain)) form <- as.formula(paste("~", paste(names(CCTrain)[-m], collapse = "+"))) require(MASS) sCCLR <- stepAIC(CCLR, scope = list(lower = ~1, upper = form), trace = FALSE, k = log(nrow(CCTrain)))

  5. Trees ##### trees require(rpart) CCTM <- rpart(credit_card_owner ~ ., CCTrain) plotcp(CCTM) # other checks possibly... ### Constructing the formula explicitly m <- match("credit_card_owner", names(SwissCC)) others <- names(SwissCC)[-m] plusGroup <- function(nam) if(length(nam) == 1) as.name(nam) else call("+", Recall(nam[-length(nam)]), as.name(nam[length(nam)])) form <- call("~", as.name("credit_card_owner"), plusGroup(others))

  6. Bagging • Two versions, 'discrete' and 'bayesian' ('smooth') • Discrete bagging: • Construct a forest of trees got from ordinary bootstrap samples of the training data • Bayesian bootstrap bagging: • Construct a forest of trees got by randomly weighting the entries of the training data, with Gamma(1,1) (i.e. exponential) weights. • Predict the class for each tree in the forest • Declare the simple majority vote the predictor

  7. #### bagged trees baggedRPart <- function(object, style = c("discrete", "bayesian"), data = eval.parent(object$call$data), nbags = 100, seed = 130244) { style <- match.arg(style) set.seed(seed) k <- nrow(data) bag <- structure(list(), randomSeed = seed) switch(style, discrete = for(i in 1:nbags) bag[[i]] <- update(object, data = data[sample(k, rep = TRUE), ]), bayesian = for(i in 1:nbags) bag[[i]] <- update(object, weights = rexp(k)) ) oldClass(bag) <- "baggedRPart" bag }

  8. Notes • Would work with any model object that allows updating with new data sets or new weights • Examples: nls(), tree(), glm(), ... • A more general function would return an object with some information on the class of the primary object in the inheritance train • e.g. oldClass(val) <- c("bagged", paste("bagged", class(object), sep = ".")) • Thepackage 'ipred' has a more general bagging() function that constructs the initial object itself

  9. Prediction from the forest of bags predict.baggedRPart <- function(object, newdata, ...) { preds <- lapply(object, predict, newdata, type = "class") preds <- do.call("cbind", lapply(preds, as.character)) preds <- apply(preds, 1, function(x) { tx <- table(x) names(tx)[which(tx == max(tx))[1]] }) names(preds) <- names(object) preds }

  10. Notes • For a more general function, the predict method would be dispatched on the inherited class e.g. • bagged.rpart, bagged.tree, bagged.glm, &c • With glm, there may be no advantage in bagging if the same model is used every time • May need to bag the variable selection as well and the calculation could get very extensive!

  11. Recording the results and checking ############# some tests bdCCTM <- baggedRPart(CCTM) bbCCTM <- baggedRPart(CCTM, style = "b") CCTest <- transform(CCTest, psCCLR = ifelse(predict(sCCLR, CCTest, type = "response") > 0.5, "yes", "no"), pCCTM = predict(CCTM, CCTest, type = "class"), pbdCCTM = predict(bdCCTM, CCTest), pbbCCTM = predict(bbCCTM, CCTest))

  12. Confusion, confusion, confusion ... ### look at the confusion matrices confusion <- function(f1, f2) { tab <- table(f1, f2) names(dimnames(tab)) <- c(deparse(substitute(f1)), deparse(substitute(f2))) oldClass(tab) <- "confusion" tab } errorRate <- function(confusion) 1 - sum(diag(confusion))/sum(confusion) print.confusion <- function(x, ...) { x <- unclass(x) NextMethod("print", x) cat("Error rate: ", format(100*round(errorRate(x), 4)), "%\n") invisible(x) }

  13. The final showdown with(CCTest, { print(confusion(credit_card_owner, psCCLR)) print(confusion(credit_card_owner, pCCTM)) print(confusion(credit_card_owner, pbdCCTM)) print(confusion(credit_card_owner, pbbCCTM)) invisible() })

  14. psCCLR credit_card_owner no yes No 484 96 Yes 67 604 Error rate: 13.03 % pCCTM credit_card_owner No Yes No 455 125 Yes 45 626 Error rate: 13.59 % pbdCCTM credit_card_owner No Yes No 478 102 Yes 47 624 Error rate: 11.91 % pbbCCTM credit_card_owner No Yes No 478 102 Yes 51 620 Error rate: 12.23 %

  15. Comments • Not a very convincing demonstration, but bagging usually does show some improvement • With a continuous response, the predictions are averaged over the forest • Bagging has been shown to benefit only models with substantial non-linearity, so will not work much on linear models, eg • randomForest is a package that further develops the bagging idea • gbm is a 'gradient boosting method' package that uses a more systematic approach.

  16. Non-linear regression: self starting models • Stormer data • Time = b * Viscosity/(Wt – g) • 'b' and 'g' unknown parameters • initial values: • Wt * Time = b * Viscosity + g * Time • statistically nonsense, but worth using as a starting procedure

  17. An initial look at the model library(MASS) b <- coef(lm(Wt*Time ~ Viscosity + Time - 1, stormer)) names(b) <- c("beta", "theta") b beta theta 28.875541 2.843728 storm.1 <- nls(Time ~ beta*Viscosity/(Wt - theta), stormer, start = b, trace = T) 885.3645 : 28.875541 2.843728 825.1098 : 29.393464 2.233276 825.0514 : 29.401327 2.218226 825.0514 : 29.401257 2.218274 summary(storm.1)

  18. Encapsulating it in a self-starter storm.init <- function(mCall, data, LHS) { v <- eval(mCall[["V"]], data) ## links below w <- eval(mCall[["W"]], data) ## links below t <- eval(LHS, data) ## links below b <- lsfit(cbind(v, t), t * w, int = F)$coef names(b) <- mCall[c("b", "t")] b } NLSstormer <- selfStart( ~ b*V/(W-t), storm.init, c("b","t"))

  19. Checking the procedure > args(NLSstormer) function (V, W, b, t) NULL > tst <- nls(Time ~ NLSstormer(Viscosity, Wt, beta, theta), stormer, trace = T) 885.3645 : 28.875541 2.843728 825.1098 : 29.393464 2.233276 825.0514 : 29.401327 2.218226 825.0514 : 29.401257 2.218274 >

  20. Confidence intervals • Three strategies: • Normal theory: beta-hat +/- 2 sd. • Bootstrap re-sampling of the centred residuals • Bayesian bootstrap weighting of the observations • From the summary: Parameters: Estimate Std. Error beta 29.4013 0.9155 theta 2.2183 0.6655

  21. Bootstrapping the residuals > tst$call$trace <- NULL > B <- matrix(NA, 500, 2) > r <- resid(tst) > r <- r - mean(r) # mean correct > f <- stormer$Time - r > for(i in 1:500) { v <- f + sample(r, rep = T) B[i, ] <- try(coef(update(tst, v ~ .))) } > cbind(Coef = colMeans(B), SD = sd(B)) Coef SD [1,] 30.050890 0.8115827 [2,] 1.981724 0.5903946

  22. Smooth bootstrap re-sampling > n <- nrow(stormer) > B <- matrix(NA, 1000, 2) > for(i in 1:1000) B[i, ] <- try(coef(update(tst, weights = rexp(n)))) > cbind(Coef = colMeans(B), SD = sd(B)) Coef SD [1,] 29.367936 0.5709821 [2,] 2.252755 0.5811193

  23. Final remarks • Bootstrap methods appear to give reasonable standard errors for the non-linear parameter, but not the linear • The model may not in fact 'fit' very well • An alternative to for-loops would be to use unrolled loops in a batch run – details on request...!

More Related