A Practical Look at Data Preparation

IRMAC: Data Warehouse SIG November 5, 2002 Data Mining A Practical Look at Data Preparation Jason Brown Cognicase Inc.

Agenda • Crash Course in Data Mining • What • Why • How • The virtuous cycle • Data Preparation • Case Study • Background • Going through the cycle • Data Preparation • Q&A

The Crash Course • What • Why • How

Definitions Data Mining: The process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules. Knowledge Discovery Data Mining is not Data Warehousing, OLAP etc.

Definitions Modeling: • Not an ER type Data Model • A data mining model is computational, full of algorithms • A model can be descriptive or predictive. • A descriptive model helps in understanding underlying processes or behavior. • A predictive model uses known values (input) to predict an unknown value (output)

Input Output Two Types of Data Mining • Directed • Know specifically what we are looking for • Who is likely to respond to our offer? • What our customers going to be worth to us over their lifetime? • Model is a Black Box

Input Output Two Types of Data Mining • Undirected • Not exactly sure what we are looking for • How should we define our Customer Segments? • What is interesting about all of our point of sale data? • Model is a Transparent Box

Decision Trees Rule Induction If …….. Then …….. Neural Networks Clustering Nearest Neighbour Modeling Techniques

Modeling Techniques Decision Trees • The tree is built based on the input of a training data set • Training Data Set is based on historical data • Over sample the data that reflects your question • Each record of the Model Set is run through the branches of the tree until the record reaches a leaf

Modeling Techniques Decision Trees Age < 23 Income < 12 000 N Y Male ? N Y N Y 28% 37%

Modeling Techniques Neural Networks • Neural networks are a nonlinear model -similar to a ‘brain’. • The network is built based on the input of a training set. • Model sets run through this network will return accurate results based on the patterns identified in the training set. • Very Complex

Modeling Techniques Clustering • Clustering finds groups of records that are similar. • For example, customers can be clustered by: • income • age • ytd revenue

Modeling Techniques Clustering Male Income < 12 000 Age < 23 Coke Buyers Male Income < 12 000 Age < 23 Non Coke Buyers

Modeling Techniques Nearest Neighbour • Model is built based on the input of a training set. • Classifies a record by calculating the distances between the record criteria and the training data set • Then it assigns the record to the class that is most common among its nearest neighbours

Modeling Techniques Nearest Neighbour Records plotted based on: IncomeGenderAge Did Not Did Not Bought Coke Bought Coke Bought Coke

Modeling Techniques Rule Induction • A technique that infers generalizations from the information in the data IF age < 19 AND purchase is coke THEN 40% purchase chips • Describes the data, allows us to visualize what is going on

Expenses Revenues The Reasons to Mine Data Increase Profit

The Reasons to Mine Data • For Marketing/CRM • Targeting prospects • Predicting future customer behaviour • Costs Revenues • For Research • Identify drugs likely to be successful • Costs • For Process Improvement • Identify causes of production failures • Costs

Process • Many different processes for Data Mining • Vendor Driven • SAS - SEMMA • Sample, Explore, Modify, Model, Assess • SPSS - 5 A’s • Assess, Access, Analyze, Act, Automate • Consulting Companies • The Virtuous Cycle • Michael Berry and Gordon Linoff

Transform Data Business Problem Act Measure Process The Virtuous Cycle

Business Problem • Define the business problem • Understand the business and the rules • Determine if Data Mining fits the need • Understand the value to the business of solving the problem

Data for Data Mining • Type of Data Values • Categorical • Defined set of values • Ontario, Quebec, PEI … • Ranks • High, Medium, Low • 0 – 20 000, 20 001 – 35 000, 35 001 – 50 000 • Intervals • Date • Time • Temperature • True Numeric • Values that support numeric operations

Conduct Modeling 7 6 Prepare Model Set 5 Add Derived Variables 4 Transpose to Right Granularity Validate & Clean 3 Obtain Data 2 Identify Data 1 Transform Data Steps

Transform Data Step 1 - Identify Data • What data is required to meet the modeling need? • What data is available? Identify Data 1

Transform Data Step2 - Obtain Data • OLTP • Data Warehouse • Data Marts and OLAP • Self Reported • External Obtain Data 2 Identify Data 1

Transform Data Step 3: Validate & Clean • Solutions: • Change Source • Filter Out • Ignore • Integrate • Predict • Derive a New Variable • Data Issues: • Missing • Fuzzy • Incorrect • Outliers Validate & Clean 3 Obtain Data 2 Identify Data 1

Transform Data Step 4: Transpose to right granularity • Data sets for Data Mining need one view, one record • Grain must be consistent throughout • Aggregates can be problematic • Atomic data is often required to build data set • Training data sets cast from point in time of event looking back 4 Transpose to Right Granularity Validate & Clean 3 Obtain Data 2 Identify Data 1

Transform Data Step 5: Add Derived Variables • Combined Columns • Summarizations • Features from Columns • Time Series 5 Add Derived Variables 4 Transpose to Right Granularity Validate & Clean 3 Obtain Data 2 Identify Data 1

Transform Data Step 6: Prepare Model Set • The Actual Input to the modeling 6 Prepare Model Set 5 Add Derived Variables 4 Transpose to Right Granularity Validate & Clean 3 Obtain Data 2 Identify Data 1

Transform Data Step 7: Conduct Modeling • Get our result • Decision Trees • Neural Networks • Clustering • Nearest Neighbour • Rule Induction 7 Conduct Modeling 6 Prepare Model Set 5 Add Derived Variables 4 Transpose to Right Granularity Validate & Clean 3 Obtain Data 2 Identify Data 1

Act The Business has to actually do something with the results or what was the point? Marketing or Retention Campaigns Business Changes

Measure • Answer 2 Questions • Was the Data Mining effort accurate? • Were the Business Actions successful? • Use different sets of data to compare real results • Actioned Customers vs. Non Actioned • Accuracy Types • Absolute • Our prediction was 80% of Group D would buy Coke and 78% really did • Relative • Our prediction was 80% of Group D would buy Coke but 57 % really did, however Group C which we predicted had a 60% propensity to buy Coke actually bought Coke 42% of the time

And back around Transform Data Business Problem Act Measure

The Case Study

The Case Study • Background • The Business • Data Warehouse Overview • Strengths and Challenges • The Project • Business Problem • Transform Data • Act • Measure

The Business • One of the top 3 (4?) cellular phone providers in Canada • Recent Acquisitions • Clearnet • Quebectel • Important Business Concepts • Handset • Subscriber • Client • Activity - Activations, Deactivations • Churn • Usage

DW Environment

Strengths at • Commitment to Data Warehousing • Prior Experience in Data Mining • Tools already Established • Strong business support for the outcomes Data Mining would provide

Challenges at • Data Warehouse still in midst of major re-architecture effort • Ongoing billing system integration projects • A data mart for data mining had existed (Clearnet) but it was a victim of both of the above • Successful at Churn Prediction

The Case Study • Background • The Business • Data Warehouse Overview • Strengths and Challenges • The Project • Business Problem • Transform Data • Act • Measure

Using the Virtuous Cycle Transform Data Business Problem Act Measure

Business Problems • Churn Modeling • predict which subscriber is likely to leave • Behavioural Segmentation • clustering subscribers into subgroups based on some commonality • revenue, usage, demographic • Client Value Estimation • the present value of all future profits generated throughout the lifetime of that client

Transform Data Steps Conduct Modeling 7 6 Prepare Model Set 5 Add Derived Variables 4 Transpose to Right Granularity Validate & Clean 3 Obtain Data 2 Identify Data 1

Identify Data • Business Wanted: EVERY POSSIBLE VARIABLE RELATED TO A SUBSCRIBER! • They provided a detailed list, by subject area, of the variables that they believed were required to conduct the kind of Data Mining Activities desired. 1

Identify Data IT Challenges • 19 Subject Areas identified with up to 75 variables each - What is the Priority? • Avoid the big bang - How much can we actually do? • Where to source the data from? • Resources - Who is going to do it? 1

Identify Data Prioritizing • First we asked the business to rate each variable as H, M or L priority • Almost everything was given an H • Then we asked the business to rank the subject areas in order of importance • Hard to convince them of the value • Hard to find consensus • Necessary for determining a release strategy 1

Obtain Data • For each Subject Area and each variable we assessed and documented the following: • Where can it be sourced from (and when)? • What are the known issues? • Q&A back and forth on the variables with business • Identified possible additional variables 2

A Practical Look at Data Preparation

A Practical Look at Data Preparation

Presentation Transcript

A Look At

Another Look at Data Mining

SOX Compliance: A Practical Look at Application Auditor

A First Look at Data Protection Manager 2012

Practical Issues: Sample Preparation

SOX Compliance: A Practical Look at Application Auditor

A Look at …

A Look At

Assessed Practical Preparation

A look at XMM slew data

A first look at the UKIDSS data

SOA What! A Practical Look At Services Oriented Architecture

A Look at Renewal

Preparation for Practical chemistry

A First Look At VERITAS Data

Data Preparation

A Good Look at Data Protection Services

Practical Birth Preparation

A Practical Look at Data Preparation

A first look at the UKIDSS data

C_SAC_2021 Look at our Exam Preparation Products