1 / 78

Data Mining-Knowledge Presentation—ID3 algorithm

Data Mining-Knowledge Presentation—ID3 algorithm. Lecture 19. Prof. Sin-Min Lee Department of Computer Science. Data Mining Tasks. Predicting onto new data by using rules or patterns and behaviors Classification Estimation

flo
Télécharger la présentation

Data Mining-Knowledge Presentation—ID3 algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining-Knowledge Presentation—ID3 algorithm Lecture 19 Prof. Sin-Min Lee Department of Computer Science

  2. Data Mining Tasks Predicting onto new data by using rules or patterns and behaviors • Classification • Estimation Understanding the groupings, trends, and characteristics of your customer • Segmentation Visualizing the Euclidean spatial relationships, trends, and patterns of your data • Description

  3. Stages of Data Mining Process 1. Data gathering, e.g., data warehousing. 2. Data cleansing: eliminate errors and/or bogus data, e.g., patient fever = 125. 3. Feature extraction: obtaining only the interesting attributes of the data, e.g., “date acquired” is probably not useful for clustering celestial objects, as in Skycat. 4. Pattern extraction and discovery. This is the stage that is often thought of as “data mining” and is where we shall concentrate our effort. 5. Visualization of the data. 6. Evaluation of results; not every discovered fact is useful, or even true! Judgment is necessary before following your software's conclusions.

  4. Clusters of Galaxies • Skycat clustered 2x109 sky objects into stars, galaxies, quasars, etc. Each object was a point in a space of 7 dimensions, with each dimension representing radiation in one band of the spectrum. • The Sloan Sky Survey is a more ambitious attempt to catalog and cluster the entire visible universe

  5. Clustering: Examples • Cholera outbreak in London

  6. Decision trees are an alternative way of structuring rule information.

  7. outlook overcast rain sunny humidity P windy normal true false N P N P

  8. A Classification rule based on the tree if outlook = overcast then P if outlook = sunny & humidity = normal then P if outlook = rain & windy = false then P if outlook = overcast  outlook = sunny & humidity = normal  outlook = rain & windy = false then P

  9. Each internal node tests an attribute Each branch corresponds to an attribute value node Each leaf node assigns a classification Outlook Sunny Overcast Rain Humidity High Normal No Yes

  10. Top-Down Induction of Decision Trees ID3 • A  the “best” decision attribute for next node • Assign A as decision attribute for node • 3. For each value of A create new descendant • Sort training examples to leaf node according to • the attribute value of the branch • If all training examples are perfectly classified (same value of target attribute) stop, else iterate over newleaf nodes.

  11. [29+,35-] A1=? A2=? [29+,35-] True False True False [18+, 33-] [21+, 5-] [8+, 30-] [11+, 2-] Which Attribute is ”best”?

  12. Entropy • S is a sample of training examples • p+ is the proportion of positive examples • p- is the proportion of negative examples • Entropy measures the impurity of S Entropy(S) = -p+ log2 p+ - p- log2 p-

  13. Entropy • Entropy(S)= expected number of bits needed to encode class (+ or -) of randomly drawn members of S (under the optimal, shortest length-code) Why? • Information theory optimal length code assign –log2 p bits to messages having probability p. • So the expected number of bits to encode (+ or -) of random member of S: -p+ log2 p+ - p- log2 p-

  14. [29+,35-] A1=? A2=? [29+,35-] True False True False [18+, 33-] [21+, 5-] [8+, 30-] [11+, 2-] Information Gain • Gain(S,A): expected reduction in entropy due to sorting S on attribute A Gain(S,A)=Entropy(S) - vvalues(A) |Sv|/|S| Entropy(Sv) Entropy([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64 = 0.99

  15. [29+,35-] A1=? A2=? [29+,35-] True False True False [18+, 33-] [21+, 5-] [8+, 30-] [11+, 2-] Information Gain Entropy([18+,33-]) = 0.94 Entropy([8+,30-]) = 0.62 Gain(S,A2)=Entropy(S) -51/64*Entropy([18+,33-]) -13/64*Entropy([11+,2-]) =0.12 Entropy([21+,5-]) = 0.71 Entropy([8+,30-]) = 0.74 Gain(S,A1)=Entropy(S) -26/64*Entropy([21+,5-]) -38/64*Entropy([8+,30-]) =0.27

  16. Training Examples

  17. Selecting the Next Attribute S=[9+,5-] E=0.940 S=[9+,5-] E=0.940 Humidity Wind High Normal Weak Strong [3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-] E=0.592 E=0.985 Gain(S,Wind) =0.940-(8/14)*0.811 – (6/14)*1.0 =0.048 Gain(S,Humidity) =0.940-(7/14)*0.985 – (7/14)*0.592 =0.151

  18. Selecting the Next Attribute S=[9+,5-] E=0.940 Outlook Over cast Rain Sunny [3+, 2-] [2+, 3-] [4+, 0] E=0.971 E=0.971 E=0.0 Gain(S,Outlook) =0.940-(5/14)*0.971 -(4/14)*0.0 – (5/14)*0.0971 =0.247

  19. ID3 Algorithm [D1,D2,…,D14] [9+,5-] Outlook Sunny Overcast Rain Ssunny=[D1,D2,D8,D9,D11] [2+,3-] [D3,D7,D12,D13] [4+,0-] [D4,D5,D6,D10,D14] [3+,2-] Yes ? ? Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970 Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570 Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019

  20. Outlook Sunny Overcast Rain Humidity Yes Wind [D3,D7,D12,D13] High Normal Strong Weak No Yes No Yes [D6,D14] [D4,D5,D10] [D8,D9,D11] [D1,D2]

  21. The ID3 Algorithm • Given • a set of disjoint target classes {C1, C2, …, Ck}, • a set of training data, S, containing objects of more than one class. • Let T be any test on a single attribute of the data, with O1, O2, …, On • representing the possible outcomes of applying T to any object x (written as T(x)). • T produces a partition {S1, S2, …, Sn} of S such that • Si = { x | T(x) = Oi}

  22. S O2 … On O1 … S1 S2 Sn • Proceed recursively to replace each Si with a decision tree. • Crucial factor: Selecting the tests.

  23. In making this decision, Quinlan employs the notion of uncertainty • (entropy from information theory). • M = {m1, m2, …, mn} Set of messages • p(mi) Probability of the message mi being received • I(mi) = -log p(mi) Amount of information of message mi • U(M) = i p(mi) I(mi) Uncertainty of the set M • Quinlan’s assumptions: • A correct decision tree for S will classify objects in the same proportion as their representation in S. • Given a case to classify, a test can be regarded as the source of a message about that case.

  24. Let Ni be the number of cases in S that belong to a class Ci: p(cCi) = Ni / |S| The uncertainty, U(S), measures the average amount of information needed to determine the class of a random case, cS. Uncertainty measure after S has been partitioned. UT(S) = i (|Si| / |S|) U(Si) Select the test T that gains the most information, i.e., where GS(T) = U(S) – UT(S) is maximal.

  25. Evaluation of ID3 • The ID3 algorithm tends to favor tests with a large number of outcomes • over tests with a smaller number. • Its computational complexity depends on the cost of choosing the next test to branch on; • It was adapted to deal with noisy and incomplete data; • It is a feasible alternative to knowledge elicitation if sufficient data of the right kind are available; • However this method is not incremental. • Further modification were introduced in C4.5, e.g : • pruning the decision tree in order to avoid overfitting • Better test selection heuristic

  26. Search Space and Search Trees • Search space is logical space composed of • nodes are search states • links are all legal connections between search states • e.g. in chess, no link between states where W castles having previously moved K. • always just an abstraction • think of search algorithms trying to navigate this extremely complex space

  27. Search Trees • Search trees do not summarise all possible searches • instead an abstraction of one possible search • Root is null state • edges represent one choice • e.g. to set value of A first • child nodes represent extensions • children give all possible choices • leaf nodes are solutions/failures • Example in SAT • algorithm detects failure early • need not pick same variables everywhere

  28. Definition • A tree shaped structure that represents a set of decisions. These decisions are used as a basis for predictions. • They represent rules for classifying datasets. Useful knowledge can be extracted by this classification.

More Related