110 likes | 189 Vues
Learn how decision trees aid in diagnosing diseases from blood tests, using protein expression data analysis. Understand how to build decision trees, choose nodes, measure information gain, and assess purity for effective medical diagnosis.
E N D
Learning Decision Trees Brief tutorial by M Werner
Medical Diagnosis Example • Goal – Diagnose a disease from a blood test • Clinical Use • Blood sample is obtained from a patient • Blood is tested to measure current expression of various proteins, say by using a DNA microarray • Data is analyzed to produce a Yes or No answer
Data Analysis • Use a decision tree such as: P1 > K1 Y N P2 > K2 P2 > K2 N Y Y P3 > K3 P4 > K4 P4 > K4 No N Y Y N Y N Yes No Yes No Yes No
How to Build the Decision Tree • Start with samples of blood from patients known to either have the disease or not (training set). • Suppose there are 20 patients and 10 are known to have the disease and 10 not • From the training set get expression levels for all proteins of interest • i.e. if there are 20 patients and 50 proteins we get a 50 X 20 array of real numbers • Rows are proteins • Columns are patients
Choosing the decision nodes 10 have disease 10 don’t • We would like the tree to be as short as possible • Start with all 20 patients in one group • Choose a protein and a level that gains the most information 10/10 Possible splitting condition Mostly diseased Px > Kx 9/3 1/7 Mostly not diseased 10/10 Alternative splitting condition Py > Ky 7/7 3/3
How to determine information gain • Purity – A measure to which the patients in a group share the same outcome. • A group that splits 1/7 is fairly pure – Most patients don’t have the disease • 0/8 is even purer • 4/4 is the opposite of pure. This group is said to have high entropy. Knowing that a patient is in this group does not make her more or less likely to have the disease. • The decision tree should reduce entropy as test conditions are evaluated
Measuring Purity (Entropy) • Let f(i,j)=Prob(Outcome=j in node i) • i.e. If node 2 has a 9/3 split • f(2,0) = 9/12 = .75 • f(2,1) = 3/12 = .25 • Gini impurity: • Entropy:
Goal is to use a test which best reduces total entropy in the subgroups
Links • http://www.ece.msstate.edu/research/isip/publications/courses/ece_8463/lectures/current/lecture_27/lecture_27.pdf • Decision Trees & Data Mining • Andrew Moore Tutorial