160 likes | 259 Vues
LINEAR CLASSIFICATION METHODS. STAT 597 E Fengjuan Xuan Caimiao Wei Bogdan Ilie. Introduction.
E N D
LINEAR CLASSIFICATION METHODS STAT 597 E Fengjuan Xuan Caimiao Wei Bogdan Ilie
Introduction • The observations in the dataset we will work on (“BUPA liver disorders”) were sampled by BUPA Medical Research Ltd and consist of 7 variables and 345 observed vectors. The first 5 variables are measurements taken by blood tests that are thought to be sensitive to liver disorders and might arise from excessive alcohol consumption. The sixth variable is a sort of selector variable. The subjects are single male individuals. The seventh variable is a selector on the dataset, being used to split it into two sets, indicating the class identity. Among all the observations, there are 145 people belonging to the liver-disorder group (corresponding to selector number 2) and 200 people belonging to the liver-normal group.
Description of variables • The description of each variable is below: • 1. mcv mean corpuscular volume • 2. alkphos alkaline phosphotase • 3. sgpt alamine aminotransferase • 4. sgot aspartate aminotransferase • 5. gammagt gamma-glutamyl transpeptidase • 6. drinks number of half-pint equivalents of alcoholic beverages • drunk per day • 7. selector field used to split data into two sets. It is a binary categorical variable with indicators 1 and 2 ( 2 corresponding to liver disorder)
Logistic regression in full Space • Coefficients: • Value Std. Error t value • (Intercept) 5.99024204 2.684250011 2.231626 • mcv -0.06398345 0.029631551 -2.159301 • alk -0.01952510 0.006756806 -2.889694 • sgpt -0.06410562 0.012283808 -5.218709 • sgot 0.12319769 0.024254150 5.079448 • gammagt 0.01894688 0.005589619 3.389656 • drinks -0.06807958 0.040358528 -1.686870 • So the classification rule is: G(x)=
Classification error rate • the classification error on the whole training data set. • error rate: 0.2956 • Sensitivity: 0.825 • Specificity: 0.5379 The error rate and it’s standard error obtained by 10-fold cross validation • error rate:(Standard Error) 0.307461384336384 (0.0271) • Sensitivity:(Standard Error) 0.816280482802222 (0.0203) • Specificity:(Standard Error) 0.531134992458522 (0.0699)
Backward step wise model selection based on AIC • Five variables are selected after step-wise model selection. The first variable MCV is deleted. • error rate:(Standard Error) 0.329460817156602 (0.03051) • Sensitivity:(Standard Error) 0.792109881015521 (0.03433) • Specificity:(Standard Error) 0.507341628959276 (0.03863) • COMMENT: • This method has a larger classification error rate than the original one. Using stepwise doesn’t improve classification
The performance of the Logistic regression on the reduced space • The reduced space is obtained by selecting the first three principle components. The standard error is obtained by 10 fold cross validation. • error rate:(Standard Error) 0.456256232089833 (0.023414) • Sensitivity:(Standard Error) 0.372869939127443 (0.031675) • Specificity:(Standard Error) 0.783003663003663 (0.030785) • Comment: • the classification error rate is around 50%, which is not much better than the random guessing.
The classification plot on the first two principle components plane
Linear DiscriminantAnalysis • LDA assumes a multivariate normal distribution, so we make some log transformations on some variables. • Y1=mac & Y2=log(alk) • Y3=log(sgpt) & Y4=log(sgpt) • Y5=log(gammat) & Y6=log(dringks+1)
The histogram of the sgpt variable and its log transformation
The performance of the LDA based on Transformed data • Comment: the classification error is the smallest among all methods and the sensitivity is the largest • error rate: 0.263768115942029 • Sensitivity: 0.865 • Specificity: 0.558620689655172 • By the log transformation, we make the assumption of multivariate normality reasonable. So the classification becomes better.
LDA after PCA • error rate: 0.411594202898551 • Sensitivity: 0.88 • Specificity: 0.186206896551724 • Comment: the performance is not improved by PCA
Conclusion • Four different methods are applied to the liver disorder data set. The LDA based on the transformed variables works best and the Logistic regression based on the original data set second. • The classification method based on the principle component doesn’t work well. Although the first three principle components contain more than 97% variation, we may still lose the most important information for classification. • The transformations can make the LDA method work better in some cases. The LDA assumes the normality distribution which is a very strong assumption in many data sets. For example, in our data, all variables except the first one are seriously skewed. That is why log transform works.