1 / 42

An introduction to data mining --Who should provide Cake for PGF?

An introduction to data mining --Who should provide Cake for PGF?. Peng Yin MI 6 16/10/2011. Outline. Overview of data mining Background Data set and Tables What is data mining Decision tree Decision tree analysis Common Uses of Data Mining. Backgroud.

xandy
Télécharger la présentation

An introduction to data mining --Who should provide Cake for PGF?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An introduction to data mining--Who should provide Cake for PGF? Peng Yin MI 6 16/10/2011

  2. Outline • Overview of data mining • Background • Dataset and Tables • What is data mining • Decision tree • Decision tree analysis • Common Uses of Data Mining

  3. Backgroud • We got notice from some secret organization saying that at North England there is a extremely dangerous group hiding in Newcastle, named NCL-MS

  4. About this dataset • • It is a tiny subset of the PG students Personal Secrets. • • Some important data missing, while I gained from CIS. Thanks to Robin Henderson. • Used Attributes • This color=Real valued This color=Symbol valued • Successfully loaded the dataset with 10 attributes and 15records

  5. What can we do with the dataset? • Well, we can look at histograms.. • Female • Male • Pure • Applied • Stats

  6. Contingency Tables • A better name for a histogram: • A One-dimensional Contingency Table • Recipe for making a k-dimensional contingency table: • 1. Pick k attributes from your dataset. Call them a1,a2, … ak. • 2. For every possible combination of values, a1,=x1, a2,=x2,… ak,=xk ,record how frequently that combination occurs • Fun fact: A database person would call this a “k-dimensional datacube”

  7. A 2-d Contingency Table For each pair of values for attributes (year, wealth) we can see how many records match.

  8. A 2-d Contingency Table • Easier to see “interesting” things if we stretch out the Histogram bars

  9. 3-d contingency tables • • These are harder to look at! • 1st year 2nd 3rd 4th Rich Poor Male F

  10. On-Line Analytical Processing (OLAP) • Software packages and database add-ons to do this are known as OLAP tools • They usually include point and click navigation to view slices and aggregates of contingency tables • They usually include nice histogram visualization

  11. Time to stop and think • • Why would people want to look at contingency tables?

  12. Let’s continue to think • With 10 attributes, how many 1-d contingency tables are there? • • How many 2-d contingency tables? • • How many 3-d tables? • • With 100 attributes how many 3-d tables are there?

  13. Let’s continue to think • With 10 attributes, how many 1-d contingency tables are there? 10 • • How many 2-d contingency tables? 10 * 9 / 2 = 45 • • How many 3-d tables? 120 • • With 100 attributes how many 3-d tables are there? 161,700

  14. Manually looking at contingencytables • • Looking at one contingency table: can be as much fun as reading an interesting book • • Looking at ten tables: as much fun as watching BBC One • • Looking at 100 tables: as much fun as watching an infomercial • • Looking at 100,000 tables: as much fun as a three-week November vacation in Sunderland with a dying weasel.

  15. Data Mining • Data Mining is all about automating the process of searching for patterns in the data. • Which patterns are interesting? • Which might be mere illusions? • And how can they be exploited? That is what we’ll look at now. And the answer will turn out to be pgf cake with decision tree learning.

  16. Aim • • Information Gain for measuring association between inputs and outputs • • Learning a decision tree classifier from data

  17. Goal of Data Mining • — Simplification and automation of the overall statisticalprocess, from data source(s) to model application— Changed over the years— Replace statistician ð Better models, less grunge work— 1 + 1 = 0— Many different data mining algorithms / tools available— Statistical expertise required to compare different techniques— Build intelligence into the software

  18. Methods • Decision Trees • Nearest Neighbour Classification • Neural Networks • Rule Induction • K-means Clustering

  19. Learning Decision Trees • A Decision Tree is a tree-structured plan of a set of attributes to test in order to predict the output. • To decide which attribute should be tested first, simply find the one with the highest information gain. • Then recurse…

  20. A Decision Stump

  21. Recursion Step Records for 1st year students Records for 2nd year students Records for 4th year students

  22. Recursion Step

  23. Second lever of tree

  24. The final tree Predict rich Predict rich Predict rich Predict rich

  25. The final tree Don’t Split a node if all matching records have the same output value Predict rich Predict rich Predict rich Predict rich

  26. The final tree Predict rich Predict rich Don’t split a node if none of the attributes can create multiple non-empty children Predict rich Predict rich

  27. Base Cases • • Base Case One: If all records in current data subset have the same output then don’t recurse • • Base Case Two: If all records have exactly the same set of input attributes then don’t recurse

  28. Basic Decision Tree BuildingSummarized • Build Tree (Dataset, Output) • If all output values are the same in Dataset, return a leaf node that says “predict this unique output” • If all input values are the same, return a leaf node that says “predict the majority output” • Else find attribute X with highest Info Gain • Suppose X has nX distinct values (i.e. X has aritynX). • Create and return a non-leaf node with nX children. • The i’th child should be built by calling Build Tree (DSi, Output) • Where DSi built consists of all those records in Dataset for which X = ith distinct value of X.

  29. Data mining is not • — Data warehousing— SQL / Ad Hoc Queries / Reporting— Software Agents— Online Analytical Processing (OLAP)— Data Visualization

  30. Uses • — Direct mail marketing— Web site personalization— Credit card fraud detection— Gas & jewelry— Bioinformatics— Text analysis— SAS lie detector— Market basket analysis— Beer & baby diapers:

  31. Who should provide the Cake for PGF?

  32. Who should provide the Cake for PGF?

  33. Who should provide the Cake for PGF?

  34. Who should provide the Cake for PGF?

  35. Who should provide the Cake for PGF?

  36. Who should provide the Cake for PGF?

  37. Look at all the information gains…

  38. Reference: • Andrew Moore • http://www.autonlab.org/tutorials/ • Doug Alexander • http://www.laits.utexas.edu/~norman/BUS.FOR/course.mat/Alex/ • Informationgain • http://en.wikipedia.org/wiki/Information_gain_in_decision_trees

More Related