140 likes | 255 Vues
This guide illustrates the implementation of logistic regression for binary classification, focusing on model validation and prediction accuracy assessment. We will utilize the "birthwt" dataset, aiming to predict low birth weight based on various predictors. The process involves data preparation, random splitting into training and test sets, model fitting using the `glm` function, and optimization through the `stepAIC` function. Additionally, we discuss evaluating model performance through prediction accuracy, confusion matrices, and the importance of analyzing residuals.
E N D
Classification and Validation Stefan Bentink 1/21/2010
Problem Fit model (e.g. logistic regression) ? predict Class 1 Class 2 ? ? Objects/Individuals ?
Evaluation How many prediction errors in future predictions? Look at the residuals for evaluation? X: Training data Y: Binary classification label β: Regression coefficients Y- βX Fit model of the form No! Y=βX
Evaluation In order to test the prediction accuracy on new data, we need to test the model on new data! Class 1 Class 2 Training set Test set Apply model, Prediction accuracy? Fit model
N-fold cross validation 1 2 3 … n Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Train Train Train Test Train Train Test Train Train Test Train Train Test Train Train Train
Classification in R • Goto R website: http://www.r-project.org • Click on CRAN • Select mirror • Click on packages (left menu bar) • Click on CRAN Task Views • Select task classification
Problem 3 – Tutorial 2 (Lecture 4) • Read in the data file birthwt.txt. This file contains a data set on 189 births at a US hospital. The goal was to determine which set of covariates predict low birth weight. Currated version of data (birthwt_new.txt) generated by script generateBwtNew.r Binary response Predictors low age, lwt, smoke, ht, ui, ftv, ptd, race
Implement model validation • Function to fit model • Function to predict model • Randomly split data into training and test set
Multiple logistic regression model library(MASS) ##contains function stepAIC bw.new <- read.delim("birthwt_new.txt") model <- glm(low~.,family=binomial(link=logit),data=bw.new) model.opt <- stepAIC(model) log.odds <- predict(model,data=bw.new) probabilities <- exp(log.odds)/(1+exp(log.odds)) Remember from lecture 4
Splitting data into training and test set n <- nrow(bw.new) k <- 2 train.test.size <- floor(n/k) partition <- rep(1:k,each=train.test.size) partition[n] <- k ##randomly choose training and test set set.seed(123) s <- sample(1:n) training.set.1 <- s[partition==1] test.set.1 <- s[partition==2]
Train and validate model bw.train <- bw.new[training.set.1,] model.train.1 <- my.classify.logit(low~.,data=bw.train) bw.test <- bw.new[test.set.1,] true.predict.test.1 <- my.predict.logit(model.train.1,data=bw.test) class.test.1 <- as.numeric(true.predict.test.1>0.5) table(class.test.1,bw.test$low)
Function ##The general framework my.function <- function(x,y) { … do something with x and y … … assign result to z … return(z) } ##Example my.function <- function(x,y) { z <- x+y return(z) }
Function to fit model ##Function to fit logistic regression model ##f: formula (model specification) ##data: a data.fram my.classify.logit <- function(f,data) { require(MASS) model <- glm(f,family=binomial(link=logit), data=data) model.opt <- stepAIC(model) ##optimize model return(model.opt) }
Function to predict new samples given a model ##function that predicts class probabilities ##model: A model fitted by my.classify.logit ##data: a data.frame with new data my.predict.logit <- function(model,data) { log.odds <- predict(model,data) probabilities <- exp(log.odds)/(1+exp(log.odds)) return(probabilities) }