1 / 13

KDD Cup 2000 Question 1

KDD Cup 2000 Question 1. Overview. Objective Given a set of page views, predict whether the visitor will view another page or not Data Raw Data - Clicks Aggregated Data - Sessions Some sessions clipped in the middle Indicator: Session continues Methods and Tools

gary-norris
Télécharger la présentation

KDD Cup 2000 Question 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. KDD Cup 2000 Question 1

  2. Overview • Objective • Given a set of page views, predict whether the visitor will view another page or not • Data • Raw Data - Clicks • Aggregated Data - Sessions • Some sessions clipped in the middle • Indicator: Session continues • Methods and Tools • Exploratory Data Analysis - SAS • Classification Tree – Amdocs Business Insight Tool • Decision tree • Rules Extraction • Modeling • Combining models

  3. The Winning Model - Introduction This model combines … Artificial intelligence, i.e. Automated procedures with Human intuition / Domain knowledge decisions

  4. The Winning Model - general scheme

  5. Rule Generator Rule Generator Rule Generator 1466 rules 1466 rules 1466 rules 111 continue rules 111 continue rules 111 continue rules Best Best Best Hybrid Hybrid Hybrid Merged Merged Merged Rule Rule Rule Model Model Model Rules Rules Rules Building Main Model Decision Tree Decision Tree Decision Tree 5 trees 5 trees 5 trees built on 34000 cases built on 34000 cases built on 34000 cases

  6. Description of sub-models Each model captures a different aspect of the overall behavior in the data. Combining or ensembling the models provides the best prediction results. Best rule Chooses most accurate rule satisfied by each record Logistic regression on rule set + raw field values combine to define score for each record Hybrid Model Logistic regression on rule set defines score for each record as a combination of rules the record satisfies Merged Rules

  7. DATA Score Model Score Model Score Model Average Average Average Scores Scores Scores Applying Main Model Decision Tree Decision Tree Decision Tree 5 trees 5 trees 5 trees built on 34000 cases built on 34000 cases built on 34000 cases Rule Generator Rule Generator Rule Generator 1466 rules 1466 rules 1466 rules 111 continue rules 111 continue rules 111 continue rules Best Best Best Hybrid Hybrid Hybrid Merged Merged Merged Rule Rule Rule Model Model Model Rules Rules Rules

  8. The Winning Model - general scheme

  9. Decision Tree Building The Model Rule Generator Hand selected rules with near perfect accuracy Small Whitebox

  10. Rule Generator One-Click Non-crawlers Hand selected rules with near perfect accuracy Score = 1 Score = 0 Small Whitebox Decision Tree Applying The Model

  11. The prediction The prediction is not that much better than choosing the majority class. But it is enough to win first place!

  12. Final Considerations • Since both types of errors (false positives and true negatives) are given the same weight, a segment must have a very high probability of continuing to justify not being classified as the majority class. • The ratio of continue / not continue in the test set must be estimated as accurately as possible. • The cutoff point (which score threshold divides the two classes) must be carefully chosen.

  13. The End

More Related