1.52k likes | 1.69k Vues
CS 490 Sample Project Mining the Mushroom Data Set. Kirk Scott. Yellow Morels. Black Morels. This set of overheads begins with the contents of the project check-off sheet After that an example project is given. CS 490 Data Mining Project Check-Off Sheet. Student's name: _______
E N D
CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott
This set of overheads begins with the contents of the project check-off sheet • After that an example project is given
CS 490 Data Mining Project Check-Off Sheet • Student's name: _______ • 1. Meets requirements for formatting. (No pts.) [ ] • 2. Oral presentation given. (No pts.) [ ] • 3. Attendance at Other Students' Presentations. Partial points for partial attendance. 20 pts.____
I. Background Information on the Problem Domain and the Data Set
Name of Data Set: _______ • I.A. Random Information Drawn from the Online Data Files Posted with the Data Set. 3 pts.___ • I.B. Contents of the Data File. 3 pts.___ • I.C. Summary of Background Information. 3 pts.___ • I.D. Screen Shot of Open File. 3 pts.___
II. Case 1. This Needs to Be a Classification Algorithm • Name of Algorithm: _______ • i. Output Results. 3 pts.___ • ii. Explanation of Item. 2 pts.___ • iii. Graphical or Other Special Purpose Additional Output. 2 pts.___
II. Case 2. This Needs to Be a Clustering Algorithm • Name of Algorithm: _______ • i. Output Results. 3 pts.___ • ii. Explanation of Item. 2 pts.___ • iii. Graphical or Other Special Purpose Additional Output. 2 pts.___
II. Case 3. This Needs to Be an Association Mining Algorithm • Name of Algorithm: _______ • i. Output Results. 3 pts.___ • ii. Explanation of Item. 2 pts.___ • iii. Graphical or Other Special Purpose Additional Output. 2 pts.___
II. Case 4. Any Kind of Algorithm • Name of Algorithm: _______ • i. Output Results. 3 pts.___ • ii. Explanation of Item. 2 pts.___ • iii. Graphical or Other Special Purpose Additional Output. 2 pts.___
II. Case 5. Any Kind of Algorithm • Name of Algorithm: _______ • i. Output Results. 3 pts.___ • ii. Explanation of Item. 2 pts.___ • iii. Graphical or Other Special Purpose Additional Output. 2 pts.___
II. Case 6. Any Kind of Algorithm • Name of Algorithm: _______ • i. Output Results. 3 pts.___ • ii. Explanation of Item. 2 pts.___ • iii. Graphical or Other Special Purpose Additional Output. 2 pts.___
II. Case 7. Any Kind of Algorithm • Name of Algorithm: _______ • i. Output Results. 3 pts.___ • ii. Explanation of Item. 2 pts.___ • iii. Graphical or Other Special Purpose Additional Output. 2 pts.___
II. Case 8. Any Kind of Algorithm • Name of Algorithm: _______ • i. Output Results. 3 pts.___ • ii. Explanation of Item. 2 pts.___ • iii. Graphical or Other Special Purpose Additional Output. 2 pts.___
III.A. Random Babbling. 6 pts.___ • III.B. An Application of the Paired t-test. 6 pts.___ • Total out of 100 points possible: _____
Example Project • The point of this sample project is to illustrate what you should produce for your project. • In addition to the content of the project, information given in italics provides instructions or commentary or background information.
Needless to say, your project should simply contain all of the necessary content. • You don't have to provide italicized commentary.
I. Background Information on the Problem Domain and the Data Set • If you are working with your own data set you will have to produce this documentation entirely yourself. • If you are working with a downloaded data set, you can use whatever information comes with the data set. • You may paraphrase that information, rearrange it, do anything to it to help make your presentation clear.
You don't have to follow academic practice and try to document or footnote what you did when presenting the information. • The goal is simply adaptation for clear and complete presentation. • What I'm trying to say is this: There will be no penalty for "plagiarism".
What I would like you to avoid is simply copying and pasting, leading to a mass of information that is not relevant or helpful to the reader (the teacher—who will be making the grades) in understanding what you were doing. • Reorganize and edit as necessary in order to make it clear.
Finally, include a screen shot of the explorer view of the data set after you've opened the file containing it. • Already here you have a choice of what exactly to show and you need to write some text explaining what the screen shot displays.
I.A. Random Information Drawn from the Online Data Files Posted with the Data Set • This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). • Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. • This latter class was combined with the poisonous one.
The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ''leaflets three, let it be'' for Poisonous Oak and Ivy.
Number of Instances: 8124 • Number of Attributes: 22 (all nominally valued) • Attribute Information: (classes: edible=e, poisonous=p)
1. cap-shape: bell=b,conical=c,convex=x,flat=f,knobbed=k,sunken=s • 2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s • 3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y
4. bruises?: bruises=t,no=f • 5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s • 6. gill-attachment: attached=a,descending=d,free=f,notched=n • 7. gill-spacing: close=c,crowded=w,distant=d
8. gill-size: broad=b,narrow=n • 9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g,green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y • 10. stalk-shape: enlarging=e,tapering=t • 11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?
12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s • 13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s • 14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y • 16. veil-type: partial=p,universal=u • 17. veil-color: brown=n,orange=o,white=w,yellow=y • 18. ring-number: none=n,one=o,two=t
19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z • 20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y
21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y • 22. habitat: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d
Missing Attribute Values: 2480 of them (denoted by "?"), all for attribute #11. • Class Distribution: • -- edible: 4208 (51.8%) • -- poisonous: 3916 (48.2%) • -- total: 8124 instances
Logical rules for the mushroom data sets. • This is information derived by researchers who have already worked with the data set. • Logical rules given below seem to be the simplest possible for the mushroom dataset and therefore should be treated as benchmark results.
Disjunctive rules for poisonous mushrooms, from most general to most specific: • P_1) odor=NOT(almond.OR.anise.OR.none) • 120 poisonous cases missed, 98.52% accuracy • P_2) spore-print-color=green • 48 cases missed, 99.41% accuracy
P_3) odor=none.AND.stalk-surface-below-ring=scaly.AND.(stalk-color-above-ring=NOT.brown) • 8 cases missed, 99.90% accuracy • P_4) habitat=leaves.AND.cap-color=white • 100% accuracy • Rule P_4) may also be • P_4') population=clustered.AND.cap_color=white
These rules involve 6 attributes (out of 22). Rules for edible mushrooms are obtained as negation of the rules given above, for example the rule: • odor=(almond.OR.anise.OR.none).AND.spore-print-color=NOT.green • gives 48 errors, or 99.41% accuracy on the whole dataset.
Several slightly more complex variations on these rules exist, involving other attributes, such as gill_size, gill_spacing, stalk_surface_above_ring, but the rules given above are the simplest we have found.
I.B. Contents of the Data File • Here is a snippet of five records from the data file: • p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u • e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g • e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m • p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u • e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g
Incidentally, the data file contents also exist in expanded form. • Here is a record from that file: • EDIBLE,CONVEX,SMOOTH,WHITE,BRUISES,ALMOND,FREE,CROWDED,NARROW,WHITE,TAPERING,BULBOUS,SMOOTH,SMOOTH,WHITE,WHITE,PARTIAL,WHITE,ONE,PENDANT,PURPLE,SEVERAL,WOODS
Section I.C should be written by you. You should summarize the information given above, which is largely copy and paste, in a brief, well-organized paragraph that you write yourself and which conveys the basics in a concise way.
The idea is that a reader who really doesn't want or need to know the details could go to this paragraph and find out everything they needed to know in order to keep reading the rest of your write-up and have some idea of what is going on.
I.C. Summary of Background Information • The problem domain is the classification of mushrooms as either poisonous/inedible or non-poisonous/edible. • There are 8124 instances in the data set consisting of 22 nominal attributes apiece. • Roughly half of the instances are poisonous and half are non-poisonous.
There are 2480 cases of missing attribute values, all on the same attribute. • As is to be expected with non-original data sets, this set has already been extensively studied. • Other researchers have provided sets of rules they have derived which would serve as benchmarks when considering the results of the application of further data mining algorithms to the data set.
I.D. Screen Shot of Open File • ***What this shows: • The cap-shape attribute is chosen out of the list on the left. • Its different values are given in the table in the upper right. • In the lower right, the Edible attribute is selected from a (hidden) drop down list.
The graph shows the proportion of edible and inedible mushrooms among the instances containing different values of cap-shape.