1 / 21

The role of Domain Knowledge in a large scale Data Mining Project

The role of Domain Knowledge in a large scale Data Mining Project. Kopanas I., Avouris N., Daskalaki S. University of Patras. Outline of the talk. Knowledge in a DM process Case study in a large DM project: Prediction of customer insolvency in Telecommunications business

emilie
Télécharger la présentation

The role of Domain Knowledge in a large scale Data Mining Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras

  2. Outline of the talk • Knowledge in a DM process • Case study in a large DM project: Prediction of customer insolvency in Telecommunications business • The role of domain expertise (and domain experts ) in the process • Summary and conclusions University of Patras, HCI Group - SETN02

  3. Data Mining • Evolution of knowledge-based systems • Key partners in Data Mining • Data analyst / statistician • Knowledge Engineer • Domain Expert • Role of domain knowledge in Data Mining University of Patras, HCI Group - SETN02

  4. DM phases (a) Problem definition (b) Creating target data set (c ) Data pre-processing and transformation (d ) Feature and algorithm selection (e) Data Mining (f) Evaluation of learned knowledge (g) Fielding the knowledge base University of Patras, HCI Group - SETN02

  5. Case study: Prediction of Customer Insolvency in Telecommunications business Predict the insolvent customers to be, that is the customers that will refuse to pay their telephone bills in the next payment due date, while there is still time for preventive (and possibly avertive) measures • Problem Objectives • Detect as many insolvent customers as possible • Minimize false alarms (solvent customers classified as insolvent) University of Patras, HCI Group - SETN02

  6. Case study: problem characteristics • Significant loss of revenue for the company • Human behavior is (generally) unpredictable • Insolvency cases are rare compared to non-insolvencies • Information can be retrieved only after processing huge amounts of data from several sources University of Patras, HCI Group - SETN02

  7. Due Date Issue of Bill The billing process (domain knowledge) Jun Jul Aug Sept Oct Nov Dec Jan Feb Mar Apr Billing Period Service Interruption Nullification University of Patras, HCI Group - SETN02

  8. Target data set definition (semantic value of data) • Data from 3 different cities (combination of rural, urban and touristic areas) • Types of data • Customer data (coded) • Data from billing and payments • Call detail records (from switching centers) • Time span of data studied • Cases of collected and uncollected bills (10/99-2/01) • Calls records (8/99-12/00) University of Patras, HCI Group - SETN02

  9. DATA WAREHOUSE Data pre-processing (knowledge-based reduction of search space) • Eliminated inexpensive calls (< 0.3 €) • Synchronizing data • Removing noise • Missing values • Data aggregation by period University of Patras, HCI Group - SETN02

  10. Dataset for model fitting • Stratified sample of solvent customers • Class distribution: 90% solvent customers and 10% insolvent customers • 2066 total number of cases and 46 variables • 2 variables describing the phone account • 4 variables describing customer attitude towards previous phone bills • 40 variables summarizing customer call habits over fifteen 2-week periods University of Patras, HCI Group - SETN02

  11. Data mining • Classification problem • 2 classes: solvent and insolvent customers • Distribution among classes in original dataset: 99% of solvent customers and 1% of insolvent customers • Very small number of insolvencies • Very different costs of misclassification between the two classes of customers University of Patras, HCI Group - SETN02

  12. Criteria for evaluation of prediction The precision of the classifier, defined as the percentage of the actually insolvent customers in those, predicted as insolvent by the classifier. The accuracy of the classifier, defined as the percentage of the correctly predicted insolvent out of the total cases of insolvent customers in the data set. Precision > 30% & Accuracy > 70% University of Patras, HCI Group - SETN02

  13. Features selected (most popular in 50 classifiers) TrendUnitsMax TrendDif5 TrendDif8 Average_Dif Type MaxSec TrendUnits5 AverageUnits TrendCount5 CountInstallments • NewCust • Latency • Count_X_charges • CountResiduals • StdDif • TrendDif11 • TrendDif10 • TrendDif7 • TrendDif6 • TrendDif3 TrendDifxx , StdDif dispersion of called telephone numbers in a given time interval xx University of Patras, HCI Group - SETN02

  14. Deployment of the Knowledge-based system • The classifiers are combined (voting algorithms have been used) • Heuristics are used as applicability criteria • Visualization plays an important role in the design of the system • The roles of the user and the knowledge-based system have to be carefully defined University of Patras, HCI Group - SETN02

  15. Stepwise Discriminant Analysis University of Patras, HCI Group - SETN02

  16. Decision Tree University of Patras, HCI Group - SETN02

  17. Neural Network University of Patras, HCI Group - SETN02

  18. Evaluation of classifiers (example) • Performance over 90% in the majority class and over 83% in the minority class. • precision= 113/2844= 3.9% • accuracy= 113/136= 83%, University of Patras, HCI Group - SETN02

  19. stage DK Type of DK (a) Problem definition HIGH Business and domain knowledge, requirements Implicit, tacit knowledge (b) Creating target data set MEDIUM Attribute relations, semantics of corporate DB (c ) Data pre-processing HIGH Tacit and implicit knowledge for inferences (d ) Feature and algorithm selection MEDIUM Interpretation of the selected features (e) Data Mining LOW Inspection of discovered knowledge (f) Evaluation of learned knowledge MEDIUM Definition of criteria related to business objectives (g) Fielding the knowledge base HIGH Supplementary domain knowledge necessary for implementing the system University of Patras, HCI Group - SETN02

  20. Selection of DM tool (Elder 98) University of Patras, HCI Group - SETN02

  21. Conclusion • Data mining is a knowledge-driven process • All stages contribute to the success of the process • Domain experts play significant role in most phases of the process • Need for selection of algorithms and techniques that support interpretation of mined knowledge • Need for integrated tools and adequate techniques to support involvement of domain experts in the process University of Patras, HCI Group - SETN02

More Related