230 likes | 357 Vues
This session explores data operations on Swiss credit card data to build a predictor for credit card ownership. The main response variable is whether a customer owns a credit card, with various personal details and account usage behaviors as predictors. Techniques utilized include logistic regression with stepwise variable selection, decision trees, and bootstrap bagging methods. The data is split into training and test sets, and multiple models are constructed to evaluate predictive accuracy, highlighting the effectiveness of bagging in improving prediction outcomes.
E N D
Session 3: Some Data Operations The Swiss Credit Card Data: Bagging The Stormer Viscometer Data: Non-linear Regression, Bootstrap Confidence Intervals
Swiss Credit Card Data • Main response: Does this customer own a credit card? • Predictors: various personal details (age, sex, language, ...) and account use behaviour • Problem: Build a predictor for credit card ownership • Strategies: • Logistic regression with stepwise variable selection, penalized by BIC • Trees • Trees with discrete bagging • Trees with smooth bagging
Split data into "train" and "test" sets SwissCC <- read.csv("SwissCC.csv") ### set seed set.seed(300424) ### k <- nrow(SwissCC) isp <- sample(k, floor(0.4*k)) CCTrain <- SwissCC[isp, ] CCTest <- SwissCC[-isp, ] require(ASOP) Save(SwissCC, CCTrain, CCTest, isp)
Stepwise Logistic Regression ### try a logistic regression CCLR <- glm(credit_card_owner ~ 1, binomial, CCTrain) ### construct an upper list as a formula m <- match("credit_card_owner", names(CCTrain)) form <- as.formula(paste("~", paste(names(CCTrain)[-m], collapse = "+"))) require(MASS) sCCLR <- stepAIC(CCLR, scope = list(lower = ~1, upper = form), trace = FALSE, k = log(nrow(CCTrain)))
Trees ##### trees require(rpart) CCTM <- rpart(credit_card_owner ~ ., CCTrain) plotcp(CCTM) # other checks possibly... ### Constructing the formula explicitly m <- match("credit_card_owner", names(SwissCC)) others <- names(SwissCC)[-m] plusGroup <- function(nam) if(length(nam) == 1) as.name(nam) else call("+", Recall(nam[-length(nam)]), as.name(nam[length(nam)])) form <- call("~", as.name("credit_card_owner"), plusGroup(others))
Bagging • Two versions, 'discrete' and 'bayesian' ('smooth') • Discrete bagging: • Construct a forest of trees got from ordinary bootstrap samples of the training data • Bayesian bootstrap bagging: • Construct a forest of trees got by randomly weighting the entries of the training data, with Gamma(1,1) (i.e. exponential) weights. • Predict the class for each tree in the forest • Declare the simple majority vote the predictor
#### bagged trees baggedRPart <- function(object, style = c("discrete", "bayesian"), data = eval.parent(object$call$data), nbags = 100, seed = 130244) { style <- match.arg(style) set.seed(seed) k <- nrow(data) bag <- structure(list(), randomSeed = seed) switch(style, discrete = for(i in 1:nbags) bag[[i]] <- update(object, data = data[sample(k, rep = TRUE), ]), bayesian = for(i in 1:nbags) bag[[i]] <- update(object, weights = rexp(k)) ) oldClass(bag) <- "baggedRPart" bag }
Notes • Would work with any model object that allows updating with new data sets or new weights • Examples: nls(), tree(), glm(), ... • A more general function would return an object with some information on the class of the primary object in the inheritance train • e.g. oldClass(val) <- c("bagged", paste("bagged", class(object), sep = ".")) • Thepackage 'ipred' has a more general bagging() function that constructs the initial object itself
Prediction from the forest of bags predict.baggedRPart <- function(object, newdata, ...) { preds <- lapply(object, predict, newdata, type = "class") preds <- do.call("cbind", lapply(preds, as.character)) preds <- apply(preds, 1, function(x) { tx <- table(x) names(tx)[which(tx == max(tx))[1]] }) names(preds) <- names(object) preds }
Notes • For a more general function, the predict method would be dispatched on the inherited class e.g. • bagged.rpart, bagged.tree, bagged.glm, &c • With glm, there may be no advantage in bagging if the same model is used every time • May need to bag the variable selection as well and the calculation could get very extensive!
Recording the results and checking ############# some tests bdCCTM <- baggedRPart(CCTM) bbCCTM <- baggedRPart(CCTM, style = "b") CCTest <- transform(CCTest, psCCLR = ifelse(predict(sCCLR, CCTest, type = "response") > 0.5, "yes", "no"), pCCTM = predict(CCTM, CCTest, type = "class"), pbdCCTM = predict(bdCCTM, CCTest), pbbCCTM = predict(bbCCTM, CCTest))
Confusion, confusion, confusion ... ### look at the confusion matrices confusion <- function(f1, f2) { tab <- table(f1, f2) names(dimnames(tab)) <- c(deparse(substitute(f1)), deparse(substitute(f2))) oldClass(tab) <- "confusion" tab } errorRate <- function(confusion) 1 - sum(diag(confusion))/sum(confusion) print.confusion <- function(x, ...) { x <- unclass(x) NextMethod("print", x) cat("Error rate: ", format(100*round(errorRate(x), 4)), "%\n") invisible(x) }
The final showdown with(CCTest, { print(confusion(credit_card_owner, psCCLR)) print(confusion(credit_card_owner, pCCTM)) print(confusion(credit_card_owner, pbdCCTM)) print(confusion(credit_card_owner, pbbCCTM)) invisible() })
psCCLR credit_card_owner no yes No 484 96 Yes 67 604 Error rate: 13.03 % pCCTM credit_card_owner No Yes No 455 125 Yes 45 626 Error rate: 13.59 % pbdCCTM credit_card_owner No Yes No 478 102 Yes 47 624 Error rate: 11.91 % pbbCCTM credit_card_owner No Yes No 478 102 Yes 51 620 Error rate: 12.23 %
Comments • Not a very convincing demonstration, but bagging usually does show some improvement • With a continuous response, the predictions are averaged over the forest • Bagging has been shown to benefit only models with substantial non-linearity, so will not work much on linear models, eg • randomForest is a package that further develops the bagging idea • gbm is a 'gradient boosting method' package that uses a more systematic approach.
Non-linear regression: self starting models • Stormer data • Time = b * Viscosity/(Wt – g) • 'b' and 'g' unknown parameters • initial values: • Wt * Time = b * Viscosity + g * Time • statistically nonsense, but worth using as a starting procedure
An initial look at the model library(MASS) b <- coef(lm(Wt*Time ~ Viscosity + Time - 1, stormer)) names(b) <- c("beta", "theta") b beta theta 28.875541 2.843728 storm.1 <- nls(Time ~ beta*Viscosity/(Wt - theta), stormer, start = b, trace = T) 885.3645 : 28.875541 2.843728 825.1098 : 29.393464 2.233276 825.0514 : 29.401327 2.218226 825.0514 : 29.401257 2.218274 summary(storm.1)
Encapsulating it in a self-starter storm.init <- function(mCall, data, LHS) { v <- eval(mCall[["V"]], data) ## links below w <- eval(mCall[["W"]], data) ## links below t <- eval(LHS, data) ## links below b <- lsfit(cbind(v, t), t * w, int = F)$coef names(b) <- mCall[c("b", "t")] b } NLSstormer <- selfStart( ~ b*V/(W-t), storm.init, c("b","t"))
Checking the procedure > args(NLSstormer) function (V, W, b, t) NULL > tst <- nls(Time ~ NLSstormer(Viscosity, Wt, beta, theta), stormer, trace = T) 885.3645 : 28.875541 2.843728 825.1098 : 29.393464 2.233276 825.0514 : 29.401327 2.218226 825.0514 : 29.401257 2.218274 >
Confidence intervals • Three strategies: • Normal theory: beta-hat +/- 2 sd. • Bootstrap re-sampling of the centred residuals • Bayesian bootstrap weighting of the observations • From the summary: Parameters: Estimate Std. Error beta 29.4013 0.9155 theta 2.2183 0.6655
Bootstrapping the residuals > tst$call$trace <- NULL > B <- matrix(NA, 500, 2) > r <- resid(tst) > r <- r - mean(r) # mean correct > f <- stormer$Time - r > for(i in 1:500) { v <- f + sample(r, rep = T) B[i, ] <- try(coef(update(tst, v ~ .))) } > cbind(Coef = colMeans(B), SD = sd(B)) Coef SD [1,] 30.050890 0.8115827 [2,] 1.981724 0.5903946
Smooth bootstrap re-sampling > n <- nrow(stormer) > B <- matrix(NA, 1000, 2) > for(i in 1:1000) B[i, ] <- try(coef(update(tst, weights = rexp(n)))) > cbind(Coef = colMeans(B), SD = sd(B)) Coef SD [1,] 29.367936 0.5709821 [2,] 2.252755 0.5811193
Final remarks • Bootstrap methods appear to give reasonable standard errors for the non-linear parameter, but not the linear • The model may not in fact 'fit' very well • An alternative to for-loops would be to use unrolled loops in a batch run – details on request...!