by Huy Nguyen Anh Pham and Evangelos Triantaphyllou

Prediction of Diabetes by Employing a Meta-Heuristic Which Can Optimize the Performance of Existing Data Mining Approaches by Huy Nguyen Anh Pham and Evangelos Triantaphyllou ICIS’2008 – Portland, Oregon, May 14 - 16, 2008 Department of Computer Science, Louisiana State University Baton Rouge, LA 70803 Emails: hpham15@lsu.edu and trianta@lsu.edu

Outline • Diabetes and the Pima Indian Diabetes (PID) dataset • Selected current work • Motivation • The Homogeneity Based Algorithm (HBA) • Rationale for the HBA • Some computational results • Conclusions

Diabetes and the PID dataset • Diabetes: If the body does not produce or properly use insulin, the redundant amount of sugar will be driven out by urination. This phenomenon (or disease) is called diabetes. • 20.8 million children and adults in the United States (approximately 7% of the population) were diagnosed with diabetes (American Diabetes Association, 11/2007). • The Pima Indian Diabetes (PID) dataset : 768 records describing female patients of Pima Indian heritage which are at least 21 years old living near Phoenix, Arizona, USA (UCI-Machine Learning Repository, 2007).

Diabetes and the PID dataset – Cont’d • The eight attributes for each record in the PID:

Selected Current work • 76.0% diagnosis accuracy by Smith et al (1988) when using an early neural network. • 77.6% diagnosis accuracy by Jankowski and Kadirkamanathan (1997) when using IncNet. • 77.6% diagnosis accuracy by Au and Chan (2001) using a fuzzy approach. • 78.6% diagnosis accuracy by Rutkowski and Cpalka (2003) when using a flexible neural-fuzzy inference system (FLEXNFIS). • 81.8% diagnosis accuracy by Davis (2006) when using a fuzzy neural network. • Less than 78% diagnosis accuracy by the Statlog project (1994) when using different classification algorithms.

Motivation • In medical diagnosis there are three different types of possible errors: • The false-negative type in which a patient, who in reality has that disease, is diagnosed as disease free. • The false-positive type in which a patient, who in reality does not have that disease, is diagnosed as having that disease. • The unclassifiable type in which the diagnostic system cannot diagnose a given case. This happens due to insufficient knowledge extracted from the historic data.

Motivation – Cont’d • Current medical data mining approaches often: • Assign equal penalty costs for the false-positive and the false-negative types: • Diagnose a new patient to be in the false-positive type: • Make the patient to worry unnecessarily. • Lead to unnecessary treatments and expenses. • Not life-threatening possibilities. • Diagnose a new patient to be in the false-negative type: • No treatment on time or none at all. • Conditions may deteriorate and the patient’s life may be at risk. => The two penalty costs for the false-positive and the false-negative types may be significantly different.

Motivation – Cont’d • Current medical data mining approaches ignore the penalty cost for the unclassifiable type: • Because of insufficient knowledge extracted from the historic data, a given patient should be predicted as in the unclassifiable type. • However, in reality current approaches have often predicted the patient as either having diabetes or being disease free. • Such misdiagnosis may lead to either unnecessary treatments or no treatment when one is needed. => Consideration for the unclassifiable type is required.

Outline • Diabetes and the PID dataset • Selected current work • Motivation • The Homogeneity Based Algorithm (HBA) • Rationale for the HBA • Some computational results • Conclusions

The Homogeneity Based Algorithm - HBA • Developed by the authors of this presentation (Pham and Triantaphyllou, 2007 and 2008). • Define the total misclassification cost TC as an optimization problem in terms of thefalse-positive, the false-negative, and the unclassifiable costs is: (1) Where: • RateFP, RateFN, and RateUC are the false positive, the false negative, and the unclassifiable rates, respectively. • CFP, CFN, and CUC are the penalty costs for the false positive, the false negative, and the unclassifiable cases, respectively. Usually, CFN is much higher then CFP and CUC. • Minimize the total misclassification cost.

Q P Q P The HBA - Some key observations • Please notice that: • Pattern A covers a region that is not adequately populated by training points. • Pattern B does not have such sparely covered regions. • The assumption that point P is a diabetes case may not be that accurate. • However, the assumption that point Q is a diabetes case may be more accurate. • The accuracy of the inferred systems can be increased if the derived patterns correspond to homogenous sets. • A homogenous set describes a steady or uniform distribution of a set of distinct points.

B Q A1 P S A2 The HBA - Some key observations – cont’d • Break pattern A into A1 and A2. Suppose that all patterns A1, A2 and B correspond to homogenous sets. • The number of points in B is higher than that in A1. • Thus, the assumption that point Q is a diabetes case may be more accurate than the assumption that point S is a diabetes case . • The accuracy of the inferred systems may also be affected by the density, thus to be used as the Homogeneity Degree (HD).

The HBA – The Main Algorithm • Phase # 1: Assume that given is a training dataset T. We divide T into the two sub datasets: • T1whose size is, say, equal to 90% of T ’s size • T2 whose size is, say, equal to 10% of T ’s size. • The training points in T1 are randomly selected from T. • Phase #2: • Apply a classification approach (such as a DT, ANN, or SVM) on the training dataset T1 to infer the classification systems. Suppose that each classification system consists of a set of patterns. • Break the inferred patterns into hyperspheres. • Phase #3: • Determine whether or not the hyperspheres are homogenous sets. • If so, then compute their Homogeneity Degrees and go to phase #4. • Otherwise, break them into smaller hyperspheres and repeat phase #3 until all the hyperspheres are homogenous sets.

The HBA – The Main Algorithm – cont’d • Phase #4: • Sort the Homogeneity Degrees in decreasing order. • For each homogenous set, do • If its Homogeneity Degree is greater than a certain threshold value, then expand it. • Otherwise, break it into smaller homogenous sets and then we may expand them. • The approach stops when all of the homogenous sets have been processed. • Phase #5: • Apply a genetic algorithm (GA) for Phases #2 to #4 to find optimal threshold values by using the total misclassification cost as an objective function and the dataset T2 as a calibration dataset. • After obtaining the optimal threshold values, the training points in T2 can be divided into two sub datasets: • T2,1 consists of the classifiable points • T2,2 includes the unclassifiable points. • Let S1 denote an inferred classification system after the GA approach is completed. • The dataset T2,2 then uses Phases #2 to #4 with the optimal threshold values obtained from the GA approach to infer the additional classification system S2. • The final classification system is the union of S1 and S2.

Rationale for the HBA • Consider the problem as a optimization formulation in terms of the false-positive, the false-negative, and the unclassifiable costs. • The HBA optimally adjusts the inferred classification systems. • We use the Homogeneity Degree in the control conditions for both expansion (to control generalization) and breaking (to control fitting). • Homogenous sets are expanded in decreasing order of their Homogeneity Degrees.

Some computational results Parameters: • The four parameters needed in the HBA consist of: • Two expansion threshold values α+ and α- to be used for expanding the positive and negative homogenous sets, respectively. • Two breaking threshold values β+and β- to be used for breaking the positive and negative homogenous sets, respectively. Experimental methodology: • Step 1: The original algorithm was first trained on the training dataset T and then derived the value for TC by using the testing dataset. • Step 2: The HBA was trained on the training dataset T1and then derived the value for TC by also using the testing dataset. • Step 3: The two values of TC returned in steps 1 and 2 were compared with each other.

Some computational results – Cont’d • Case 1: minimize TC = 1 × RateFP + 1 × RateFN + 0 × RateUC (i.e., only the false-positive and the false-negative costs are considered and do so equally). • The HBA, on the average, decreased the total misclassification cost by about 81.57%.

Some computational results – Cont’d • Case 2: minimize TC = 3 × RateFP + 3 × RateFN + 3 × RateUC (i.e., all three costs are assumed to be equal). • The HBA, on the average, decreased the total misclassification cost by about 50.48%. • Case 3: minimize TC = 1 × RateFP + 20 × RateFN + 3 × RateUC (i.e., the false-negative cost is assumed to be much higher than the other two costs). • The HBA, on the average, decreased the total misclassification cost by about 51.59%.

Some computational results – Cont’d • Case 4: minimize TC = 1 × RateFP + 100 × RateFN + 3 × RateUC (i.e., the false-negative cost is assumed to the significantly higher than the other two costs). • The HBA, on the average, decreased the total misclassification cost by about 76.00%. • The higher the penalty cost for false-negative type is set, the fewer cases of false-negative can be found.

Conclusions • Millions of people in the United States and in the world have diabetes. • The ability to predict diabetes early plays an important role for the patient’s treatment process. • The correct prediction percentage of current algorithms may oftentimes be coincidental. • This study identified the need for different penalty costs for the false-positive, the false-negative, and the unclassifiable types of errors in medical data mining. • This study applied a meta heuristic approach, called the Homogeneity-Based Algorithm (HBA), for enhancing the diabetes prediction. • The HBA first defines the desired goal as an optimization problem in terms of thefalse-positive, the false-negative, and the unclassifiable costs. • The HBA is then used in conjunction with traditional classification algorithms (such as SVMs, DTs, ANNs, etc) to enhance the diabetes prediction. • The Pima Indian diabetes dataset has been used for evaluating the performance of the HBA. • The obtained results appear to be very important both for accurately predicting diabetes and also for the medical data mining community in general. These slides are also available at: http://www.csc.lsu.edu/trianta

References • Asuncion A. and D.J. Newman, “UCI-Machine Learning Repository,” University of California, Irvine, California, USA, School of Information and Computer Sciences, 2007. • Smith J. W., J. E. Everhart, W. C. Dickson, W. C. Knowler, and R. S. Johannes, “Using the ADAP learning algorithm to forecast the onset of diabetes mellitus,” Proceedings of 12th Symposium on Computer Applications and Medical Care, Los Angeles, California, USA, 1988, pp. 261 - 265. • Jankowski N. and V. Kadirkamanathan, “Statistical control of RBF-like networks for classification,” Proceedings of the 7th International Conference on Artificial Neural Networks (ICANN), Lausanne, Switzerland, 1997, pp. 385 - 390. • Au W. H. and K. C. C. Chan, “Classification with degree of membership: A fuzzy approach,” Proceedings of the 1st IEEE Int'l Conference on Data Mining, San Jose, California, USA, 2001, pp. 35 - 42. • Rutkowski L. and K. Cpalka, “Flexible neuro-fuzzy systems,” IEEE Transactions on Neural Networks, Vol. 14, 2003, pp. 554 - 574. • Davis W. L. IV, “Enhancing Pattern Classification with Relational Fuzzy Neural Networks and Square BK-Products,” PhD Dissertation in Computer Science, 2006, pp. 71 - 74. • Michie D., D. J. Spiegelhalter, and C. C. Taylor, “Machine Learning, Neural and Statistical Classification,” Englewood Cliffs in Series Artificial Intelligence, Prentice Hall, Chapter 9, 1994, pp. 157 - 160. • Pham H. N. A. and E. Triantaphyllou, “The Impact of Overfitting and Overgeneralization on the Classification Accuracy in Data Mining,” in Soft Computing for Knowledge Discovery and Data Mining, (O. Maimon and L. Rokach, Editors), Springer, New-York, USA, 2007, Part 4, Chapter 5, pp. 391 - 431. • Pham H. N. A. and E. Triantaphyllou, "An Optimization Approach for Improving Accuracy by Balancing Overfitting and Overgeneralization in Data Mining," submitted for publication, January 2008. Thank you Any questions?

by Huy Nguyen Anh Pham and Evangelos Triantaphyllou