130 likes | 228 Vues
KDD Cup 2000 Question 1. Overview. Objective Given a set of page views, predict whether the visitor will view another page or not Data Raw Data - Clicks Aggregated Data - Sessions Some sessions clipped in the middle Indicator: Session continues Methods and Tools
E N D
KDD Cup 2000 Question 1
Overview • Objective • Given a set of page views, predict whether the visitor will view another page or not • Data • Raw Data - Clicks • Aggregated Data - Sessions • Some sessions clipped in the middle • Indicator: Session continues • Methods and Tools • Exploratory Data Analysis - SAS • Classification Tree – Amdocs Business Insight Tool • Decision tree • Rules Extraction • Modeling • Combining models
The Winning Model - Introduction This model combines … Artificial intelligence, i.e. Automated procedures with Human intuition / Domain knowledge decisions
Rule Generator Rule Generator Rule Generator 1466 rules 1466 rules 1466 rules 111 continue rules 111 continue rules 111 continue rules Best Best Best Hybrid Hybrid Hybrid Merged Merged Merged Rule Rule Rule Model Model Model Rules Rules Rules Building Main Model Decision Tree Decision Tree Decision Tree 5 trees 5 trees 5 trees built on 34000 cases built on 34000 cases built on 34000 cases
Description of sub-models Each model captures a different aspect of the overall behavior in the data. Combining or ensembling the models provides the best prediction results. Best rule Chooses most accurate rule satisfied by each record Logistic regression on rule set + raw field values combine to define score for each record Hybrid Model Logistic regression on rule set defines score for each record as a combination of rules the record satisfies Merged Rules
DATA Score Model Score Model Score Model Average Average Average Scores Scores Scores Applying Main Model Decision Tree Decision Tree Decision Tree 5 trees 5 trees 5 trees built on 34000 cases built on 34000 cases built on 34000 cases Rule Generator Rule Generator Rule Generator 1466 rules 1466 rules 1466 rules 111 continue rules 111 continue rules 111 continue rules Best Best Best Hybrid Hybrid Hybrid Merged Merged Merged Rule Rule Rule Model Model Model Rules Rules Rules
Decision Tree Building The Model Rule Generator Hand selected rules with near perfect accuracy Small Whitebox
Rule Generator One-Click Non-crawlers Hand selected rules with near perfect accuracy Score = 1 Score = 0 Small Whitebox Decision Tree Applying The Model
The prediction The prediction is not that much better than choosing the majority class. But it is enough to win first place!
Final Considerations • Since both types of errors (false positives and true negatives) are given the same weight, a segment must have a very high probability of continuing to justify not being classified as the majority class. • The ratio of continue / not continue in the test set must be estimated as accurately as possible. • The cutoff point (which score threshold divides the two classes) must be carefully chosen.