1 / 18

Application of Logistic Regression using R Programming

Performing Logistic Regression is not an easy task as it demands to satisfy certain assumptions as like Linear Regression. The prediction of the probability of occurrence of an event by fitting the dataset when the target variable is a categorical variable with two categories can be done by using logistic regression model. R programming is an easier platform to fit a logistic regression model. Statswork offers statistical services as per requirement of the customer. When you Order statistical Services at Statswork, we promise you the following u2013 Always on Time, outstanding customer support, and High-quality Subject Matter Experts.<br>Contact Us:<br><br>Website: www.statswork.com<br><br>Email: info@statswork.com<br><br>UnitedKingdom: 44-1143520021<br><br>India: 91-4448137070<br><br>WhatsApp: 91-8754446690<br>

statsworkfb
Télécharger la présentation

Application of Logistic Regression using R Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OCT 10, 2019 Research paper APPLICATION OF LOGISTIC REGRESSION USING R PROGRAMMING Experience of Statswork.com Tags: Statswork | Logistic regression | R programming | Python Expert | Programmers | Statistical Data Analysis | Data Analysis Services | Data Mining Services | Data Collection | Big Data Analytics | Statistics Services Research Planing | Data Collection | Semantic Annotation | Consumer & Retail Analytics | Econometrics Copyright © 2019 Statswork. All rights reserved

  2. 01 02 LOGISTIC REGRESSION ANALYSIS Logistic regression is used for the prediction of the probability of occurrence of an event by fitting the data into a logistic curve. Logistic regression is a type of predictive model when the target variable is a categorical variable with two categories. It makes use of predictor variables either numerical or categorical. 03 Research Planing | Data Collection | Semantic Annotation | Consumer & Retail Analytics | Econometrics Copyright © 2019 Statswork. All rights reserved

  3. USES OF LOGISTIC REGRESSION ANALYSIS Predicts the presence or absence of a characteristic or outcome based on values of a set of predictor variables. 01 Suitable for models where dependent variable is dichotomous. 02 Estimates odds ratios (OD) for each of the independent variables in the model. 03 Predicts the probability of occurrence of an event by fitting data to a logit function. 04 Research Planing | Data Collection | Semantic Annotation | Consumer & Retail Analytics | Econometrics Copyright © 2019 Statswork. All rights reserved

  4. Table 1. Blood Pressure Range EXAMPLE OF LOGISTIC REGRESSION ANALYSIS Question 1 BP Low Normal Borderline High Prediction among patient having high blood pressure (BP) or not along with other observations such as Age, Smoking habits, Weight, or Body mass Index BMI, blood cholesterol levels, Diastolic and Systolic BP value, Gender, etc. 90-130 131-140 140 <90 Systolic Diastolic <60 60-180 81-90 >90 • Source: Adapted from Nimmala et al., 2018 Research Planing | Data Collection | Semantic Annotation | Consumer & Retail Analytics | Econometrics Copyright © 2019 Statswork. All rights reserved

  5. PROCEDURES INVOLVED IN CONDUCTING LOGISTIC REGRESSION ANALYSIS Select the response or dependent variable (patients having high BP or not). 01 Other variables are considered as explanatory or independent variables. 02 Dependent variable needs to be coded as 0 and 1. 03 Explanatory variable can be a continuous variable or ordinal variable. 04 Outcome is predicted by applying logistic regression model. 05 RESULT Outcome is predicted by applying logistic regression model. Research Planing | Data Collection | Semantic Annotation | Consumer & Retail Analytics | Econometrics Copyright © 2019 Statswork. All rights reserved

  6. Question 2 Prediction among child having high blood patient having high blood pressure (BP) or not along with other observations such as Age, Smoking habits, Weight, or Body mass Index BMI, blood cholesterol levels, Diastolic and Systolic BP value, Gender, etc. Table 1. Prediction model for childhood high blood pressure Adapted from Hamoen et al., 2018 Research Planing | Data Collection | Semantic Annotation | Consumer & Retail Analytics | Econometrics Copyright © 2019 Statswork. All rights reserved

  7. LOGISTIC REGRESSION IMPLEMENTATION USING LOGIT FUNCTION AND R PROGRAMMING Research Planing | Data Collection | Semantic Annotation | Consumer & Retail Analytics | Econometrics Copyright © 2019 Statswork. All rights reserved

  8. Logistic regression generates the coefficients of a formula to predict a logit transformation of the probability of the presence of the characteristic of interest: LOGISTIC REGRESSION IMPLEMENTATION USING LOGIT FUNCTION  Logit (p) = b0 + b1X1 + b2X2 +…+ bkXk where p- probability of the presence of the characteristic of interest. The logit transformation is defined as the logged odds: Odds = p = probability of prthe esence of the characteristic  1-p probability of absence of the characteristic &                       Logit (p) = log (p/(1-p)) = log (p) – log (1-p) Research Planing | Data Collection | Semantic Annotation | Consumer & Retail Analytics | Econometrics Copyright © 2019 Statswork. All rights reserved

  9. LOGISTIC REGRESSION IMPLEMENTATION USING R PROGRAMMING R is an easier platform to fit a logistic regression model using the function glm(). Question Fitting of  binary logistic model for the Titanic dataset that is available in Kaggle. Procedure • Frame the objective of the study as survival variable:  1 - survived, 0 - not survived and other - independent variables. • Loading the training data into the console using the function read.csv (). LOGISTIC REGRESSION IN R PROGRAMMING train.data<- read.csv('train.csv',header=T,na.strings=c("")) • Check  missing data before fitting using sapply() function in R. sapply(train.data,function(x) sum(is.na(x))) • Check other missing entries in the subset data. There are different ways to replace the NA’s with either the mean or median of the data data$Age[is.na(data$Age)] <- mean(data$Age,na.rm=T) Research Planing | Data Collection | Semantic Annotation | Consumer & Retail Analytics | Econometrics Copyright © 2019 Statswork. All rights reserved

  10. PassengerId Survived Pclass Name Sex 0 0 0 0 0 Age SibSp Parch Ticket Fare 177 0 0 0 0 Cabin Embarked 687 2 “Cabin”  and “PassengerId” variables are missing which are  skipped and made as subset of the data as new with the subset () function. data <- subset(train.data,select=c(2,3,5,6,7,8,10,12)) where the numeric values is the columns in the data file. Research Planing | Data Collection | Semantic Annotation | Consumer & Retail Analytics | Econometrics Copyright © 2019 Statswork. All rights reserved

  11. FITTING LOGISTIC REGRESSION MODEL • Before fitting, Split the data into two sets: training set ( to fit the model) testing set (for testing). traindata<- data[1:800,] testdata<- data[801:889,] • Specify family = binomial in glm() function since our response is binary. lrmodel<- glm(Survived ~.,family=binomial(link='logit'),data=traindata) • The result of the model can be obtained using the following command: summary(lrmodel) Research Planing | Data Collection | Semantic Annotation | Consumer & Retail Analytics | Econometrics Copyright © 2019 Statswork. All rights reserved

  12. RESULT INTERPRETATION: From the p-value, • “Embarked”, “Fare”, and “SibSp” variables- not significant. • “sex “variable has lesser p-value so there is a strong association of passengers with the chance of survival. It is important to note that the response variable is log odds ln(odds) = ln(p/(1-p)) = a*x1 + b*x2 + … + z*xn. To analyse the deviance, use anova() function in R. anova(lrmodel, test="Chisq") Research Planing | Data Collection | Semantic Annotation | Consumer & Retail Analytics | Econometrics Copyright © 2019 Statswork. All rights reserved

  13. • Wider difference between the null deviance and the residual deviance indicates better model performance. • Large p-value indicates that the logistic regression model without that particular variable explains the same amount of variation. Hence found that Akaike Information Criterion (AIC) is the best model. Research Planing | Data Collection | Semantic Annotation | Consumer & Retail Analytics | Econometrics Copyright © 2019 Statswork. All rights reserved

  14. PREDICT THE MODEL ON A NEW SET OF DATA R enables us to do with predict() function. Let threshold for the new response data be P(y=1|X) > 0.5 then y = 1 otherwise y=0. The threshold changes according to the researchers needs. The following Table presents the accuracy prediction formulas especially when machine learning algorithms are applied. Table 4. Measures and formula fit.results<-predict(lrmodel,newdata=subset(test,select=c(2,3,4,5,6,7,8)),type='response') fit.results<- ifelse(fit.results> 0.5,1,0) Error<- mean(fit.results != test$Survived) print(paste('Accuracy',1-Error)) Contd... Research Planing | Data Collection | Semantic Annotation | Consumer & Retail Analytics | Econometrics Copyright © 2019 Statswork. All rights reserved

  15. "Accuracy 0.842696629213483.“ • Accuracy= 84.26% (here data is split manually). Finally, we find the area under the curve by plotting Receiver Operating Characteristic (ROC) curve using the following commands: library(ROCR) A<-predict(lrmodel,newdata=subset(test,select=c(2,3,4,5,6,7,8)), type="response") Pre<- prediction (A, test$Survived) Pre1<- performance(Pre, measure = "tpr", x.measure = "fpr") plot(Pre1) auc<- performance(Pre, measure = "auc") auc<- auc@y.values[[1]] auc 0.8647186 Research Planing | Data Collection | Semantic Annotation | Consumer & Retail Analytics | Econometrics Copyright © 2019 Statswork. All rights reserved

  16. RESULTS ROC curve generates the true positive rate against false positive rate, similar to the sensitivity and specificity. Since Area Under the Curve (AUC) is closer to 1, then this is an good predicting model. Research Planing | Data Collection | Semantic Annotation | Consumer & Retail Analytics | Econometrics Copyright © 2019 Statswork. All rights reserved

  17. www.statswork.com Statswork Lab @ Statswork.com Research Planing | Data Collection | Semantic Annotation | Consumer & Retail Analytics | Econometrics Copyright © 2019 Statswork. All rights reserved

  18. PHONE NUMBER UK : +44-1143520021 INDIA : +91-4448137070 GET IN TOUCH WITH US EMAIL ADDRESS info@statswork.com Research Planing | Data Collection | Semantic Annotation | Consumer & Retail Analytics | Econometrics Copyright © 2019 Statswork. All rights reserved

More Related