1 / 33

KDD Cup Survey

KDD Cup Survey. Xinyue Liu. Outline. Nuts and Bolts of KDD Cup KDD Cup 97-99 KDD Cup 2000 Summary. About KDD Cup. A knowledge discovery and data mining tools competition in conjunction with KDD conferences. It aims at:

rosie
Télécharger la présentation

KDD Cup Survey

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. KDD Cup Survey Xinyue Liu

  2. Outline • Nuts and Bolts of KDD Cup • KDD Cup 97-99 • KDD Cup 2000 • Summary

  3. About KDD Cup A knowledge discovery and data mining tools competition in conjunction with KDD conferences. It aims at: • showcase the best methods for discovering higher-level knowledge from data. • Helping to close the gap between research and industry • Stimulating further KDD research and development

  4. Statistics • Participation in KDD Cup grew steadily, especially requests to access the data • Average person-hours per submission: 204Max person-hours per submission: 910 • Commercial software grew from 44% (cup 97) to 52% (cup 98) to 77% (cup 2000)

  5. Algorithms Decision trees most widely tried and by far the most commonly submitted

  6. KDD Cup 97 • A classification task – to predict Financial services industry direct mail response • Winners • Charles Elkan, a PhD from UC-San Diego with his Boosted Naive Bayesian (BNB) • Silicon Graphics, Inc with their software MineSet • Urban Science Applications, Inc. with their software gain, Direct Marketing Selection System

  7. BNB • Boosting – to learn a series of classifiers, where each classifier in the series pays more attention to the examples misclassified by its predecessor. Repeated T rounds. • BNB – representationally equivalent to a multilayer perceptron with a single hidden layer. • Complexity – O(ef) e – examples f - attributes

  8. MineSet • A KDD tool that combines data access, transformation, classification, and visualization.

  9. KDD Cup 98 • URL:www.kdnuggets.com/meetings/kdd98/kdd-cup-98.html • A classification task – to analyze fund raising mail responses to a non-profit organization • Winners • Urban Science Applications, Inc. with their software GainSmarts. • SAS Institute, Inc. with their software Enterprise Miner. • Quadstone Limited with their software Decisionhouse

  10. GainSmarts • GainSmarts – a feature selection expert system • First step - used Logistic Regression to assign each prospect a probability of donation (Pi). • Second step - used Linear Regression to estimate a conditional donation amount of responding donors (Ai) • Result (<1% error) - Prediction = Pi * Ai

  11. Enterprise Miner • A data mining solution that addresses the entire data mining process • SEMMA Process • Sample • Explore • Modify • Model • Assess • Algorithms • Decision tree • Neural network • Regression

  12. Decisionhouse • Decisionhouse – an integrated modelling software suite by Quadstone • Data exploration using visualization modules. • Use Decision trees and Scorecards to model more complex tasks. • Choose the final model by comparing a variety of modeling approaches and looking at the difference in predicted net profitability (lift curve).

  13. Results Maximum Possible Profit Line ($72,776 in profits with 4,873 mailed) Mail to Everyone Solution ($10,560 in profits with 96,367 mailed) GainSmarts SAS/Enterprise Miner Quadstone/Decisionhouse

  14. KDD Cup 99 • URL: www.cse.ucsd.edu/users/elkan/kdresults.html • Problem same data set as KDD Cup 98 • Winners • SAS Institute Inc. with their software Enterprise Miner. • Amdocs with their Information Analysis Environment

  15. Software • SAS – using two-stage model which includes two multi-layer perceptron (MLP) neural networks models. • Amdocs – using its own Information Analysis Environment, which allows modeling of the value and class membership simultaneously. Algorithms used is a hybrid logistic regression model

  16. KDD Cup 2000 www.ecn.purdue.edu/KDDCUP/ Sponsored by Purdue University Blue Martini Software

  17. Data Set Data collected from Gazelle.com, a legwear and legcare web retailer • Pre-processed • Training set: 2 months • Test sets: one month • Data collected includes: • Click streams • Order information • Registration form

  18. Problems • The goal – to design models to support web-site personalization and to improve the profitability of the site by increasing customer response. • Questions - Whengiven a set of page views, • will the visitor view another page on the site or leave? • which product brand will the visitor view in the remainder of the session? • characterize heavy spenders • characterize killer pages • characterize which product brand a visitor will view in the remainder of the session?

  19. Evaluation • Accuracy/score was measured for the two questions with test sets • Insight questions judged with help of retail experts from Gazelle and Blue Martini Created a list of insights from all participants • Each insight was given a weigh • Each participant was scored on all insights • Additional factors: • Presentation quality • Correctness

  20. The Winners • Question 1 & 5 Winner: Amdocs • Question 2 & 3 Winner: Salford Systems • Question 4 Winner: e-steam poster

  21. Software (Amdocs) • Exploratory Data Analysis – SAS • Classification Tree – Amdocs Business Insight Tool • Decision tree • Rules Extraction • Modeling • Combining models

  22. Scheme

  23. Rule Generator Rule Generator Rule Generator 1466 rules 1466 rules 1466 rules 111 continue rules 111 continue rules 111 continue rules Best Best Best Hybrid Hybrid Hybrid Merged Merged Merged Rule Rule Rule Model Model Model Rules Rules Rules Main Model Decision Tree Decision Tree Decision Tree 5 trees 5 trees 5 trees built on 34000 cases built on 34000 cases built on 34000 cases

  24. Sub-models Each model captures a different aspect of the overall behavior in the data. Combining or ensembling the models provides the best prediction results. Best rule Chooses most accurate rule satisfied by each record Logistic regression on rule set + raw field values combine to define score for each record Hybrid Model Logistic regression on rule set defines score for each record as a combination of rules the record satisfies Merged Rules

  25. Software (Salford) • CART - a decision tree tool that automatically searching for and isolating significant patterns and relationships • MARS - a multivariate non-parametric regression procedure • HotSpotDetector • TreeNet

  26. Cart • Binary recursive partitioning. • Key elements: • Splitting rules • Brute force search all possible splits for all variables • Rank each splitting rule on the basis of a quality-of-split criterion (default GINI) • Recursion - split until further splitting is impossible or stopped. • Class assignment • Plurality rule • Assign every node whether it is terminal or not. • Pruning Trees – does not stop in the middle • Testing - best sub-tree is the one with the lowest error

  27. MARs • Automatic variable search  • Automatic variable transformation  • Automatic limited interaction searches  • Variable nesting  • Built-in testing regimens  model selection parameters.

  28. Insights (Heavy Spenders) • Some of the Good insights • Referrers - establish ad policy based on conversion rates, not click-throughs • Not an AOL user - browser window too small for layout • Referring site traffic changed dramatically over time • Came to site from print-ad or news, not friends & families • Very high and very low income • Geographic: Northeast U.S. states • Repeat visitors

  29. Insights (Who leaves?) • Some of the good insights • Crawlers, bots accounted for 16% of sessions • Long processing time (> 12 seconds) implies high abandonment • Referring sites: mycoupons have long sessions, shopnow.com are prone to exit quickly • Returning visitors' prob of continuing is double • View of specific products (Oroblue,Levante) cause abandonment • Probability of leaving decreases with page views • Free Gift and Welcome templates on first three pages encouraged visitors to stay at site

  30. Insights(Brand view) • Some good insights • Referrer URL is great predictor: • Fashionmall.com and winnie-cooper are referrers for Hanes and Donna Karan • mycoupons.com, tripod, deal-finder are referrers for American Essentials • Previous views of a product imply later views

  31. Summary • Data mining requires background knowledge and access to business users • Successful data mining solutions combine automated and manual analysis, integrating the power of the machine with expert knowledge and human insight • Web Mining is challenging: crawlers/bots, frequent site changes, etc. • KDD Cup is an excellent source to learn the state-of-art KDD techniques • KDD Cup data available for research and education

  32. References Elkan C. (1997). Boosting and Naive Bayesian Learning. Technical Report No. CS97-557, September 1997, UCSD. Decisionhouse (1998). KDD Cup 98: Quadstone Take Bronze Miner Award. Retrieved March 15, 2001 fromhttp://www.kdnuggets.com/meetings/kdd98/quadstone/index.html Urbane Science (1998). Urbane Science wins the KDD-98 Cup. Retrieved March 15, 2001 from http://www.kdnuggets.com/meetings/kdd98/gain-kddcup98-release.html Georges, J. & Milley, A. (1999). KDD’99 Competition: Knowledge Discovery Contest. Retrieved March 15, 2001 from http://www.cse.ucsd.edu/users/elkan/saskdd99.pdf Rosset, S. & Inger A. (1999). KDD-Cup 99 : Knowledge Discovery In a Charitable Organization’s Donor Database. Retrieved March 15, 2001 from http://www.cse.ucsd.edu/users/elkan/KDD2.doc

  33. References (Cont.) Sebastiani P., Ramoni M. & Crea A. (1999). Profiling your Customers using Bayesian Networks. Retrieved March 15, 2001 from http://bayesware.com/resources/tutorials/kddcup99/kddcup99.pdf Inger A., Vatnik N., Rosset S. & Neumann E. (2000). KDD-Cup 2000: Question 1 Winner’s Report. Retrieved March 18, 2000 from http://www.ecn.purdue.edu/KDDCUP/amdocs-slides-1.ppt Neumann E., Vatnik N., Rosset S., Duenias M., Sasson I. & Inger A. (2000). KDD-Cup 2000: Question 5 Winner’s Report. Retrieved March 18, 2000 from http://www.ecn.purdue.edu/KDDCUP/amdocs-slides-5.ppt Salford System white papers: http://www.salford-systems.com/whitepaper.html Summary talk presented at KDD (2000) http://robotics.stanford.edu/~ronnyk/kddCupTalk.ppt

More Related