130 likes | 242 Vues
Chapter 3 Data Mining Methodology and Best Practices. Data Mining’s Virtuous Cycle. Identify the business opportunity* Mining data to transform it into actionable information Acting on the information Measuring the results. * Textbook interchanges “problem” with “opportunity”. It’s time to….
 
                
                E N D
Data Mining’s Virtuous Cycle • Identify the business opportunity* • Mining data to transform it into actionable information • Acting on the information • Measuring the results * Textbook interchanges “problem” with “opportunity”
It’s time to… • Turn our attention to translating business opportunities (problems) into data mining opportunities (problems) including: • Transforming data into information via: • Hypothesis testing • Profiling • Predictive modeling • Taking action • Model deployment • Scoring • Measurement • Assessing a model’s stability & effectiveness before it is used
DM General Guidelines • The DM virtuous cycle (4 steps) is iterative • No steps should be skipped • Common sense prevails with respect to how rigorous each step is carried out • Simplest approach: ad-hoc queries to test hypotheses • Rigorous approach: The 4 steps of the virtuous cycle expand to become an 11-step methodology
Why have a Methodology? • A DM methodology which includes DM Best Practices helps to avoid: • Learning things that are not true • Learning things that are true, but not useful • Learning things that are not true is more dangerous than the other. Why is that? …
Learning Things that are not True • Patterns may not represent any underlying rule • Sample may not reflect its parent population, hence bias • Data may be at the wrong level of detail (granularity; aggregation) Examples?
Learning Things that are True, but not Useful • Learning things that are already known Examples? • Learning things that cannot be used Examples?
Hypothesis Testing • A hypothesis is a proposed explanation whose validity can be tested by analyzing data • Purpose is to validate or invalidate preconceived ideas • Usually included in all DM projects • Data collection done via: • Observation • Experiment (lab, survey) • Bias must be avoided and usually requires both analytical and business knowledge to do so • Hypothesis testing is useful, but often insufficient which leads us to…
Models • Model: An explanation or description of how something works that reflects reality well enough that it can be used to make inferences about the real world. • We use models every day…Examples? • DM uses models of data called Model Set • Applying model set to new data is called Score Set • Model Set includes: • Training Set – used to build a set of DM models • Validation Set – used to choose best DM model • Test Set – used to determine how the model performs • Models – 3 kinds of DM models for 3 kinds of tasks…next slide
Profiling and Prediction • Profiling • describes what is in the data • Demographic variables • Inability to distinguish cause and effect (eg. Beer drinkers and males) • Focus is on the past to explain it (timing = past) • Prediction • Finding patterns in data from prior period(s) that are capable of explaining or anticipating outcomes in a later period (timing = future) • Predictive models require separation in time between the model inputs and output.
Data Mining Methodology • Translate biz opportunity (problem) into DM opportunity (problem) • Select appropriate data • Get to know the data • Create a model set • Fix problems with the data • Transform data to bring information to the surface • Build models • Assess models • Deploy models • Assess results • Begin again
In-Class Exercise • 10 Teams • Each team take one of the 1-10 methodology steps (step 11 is skipped) • Discuss it and prepare a 5 minute (or less) summary for your colleagues • Have each team present its summary Discussion: 15 minutes Present: 45 minutes