1 / 63

Knowledge discovery & data mining Towards KD Support Environments

Knowledge discovery & data mining Towards KD Support Environments. Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa http://www-kdd.di.unipi.it/. A tutorial @ EDBT2000. Module outline. Data analysis and KD Support Environments Data mining technology trends

angelinag
Télécharger la présentation

Knowledge discovery & data mining Towards KD Support Environments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Knowledge discovery & data mining Towards KD Support Environments Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa http://www-kdd.di.unipi.it/ A tutorial @ EDBT2000

  2. Module outline • Data analysis and KD Support Environments • Data mining technology trends • from tools … • … to suites • … to solutions • Towards data mining query languages • DATASIFT: a logic-based KDSE • Future research challenges EDBT2000 tutorial - KDSE

  3. Vertical applications • We outlined three classes of vertical data analysis applications that can be tackled using KDD & DM techniques • Fraud detection • Market basket analysis • Customer segmentation EDBT2000 tutorial - KDSE

  4. Why are these applications challenging? • Require manipulation and reasoning over knowledge and data at different abstraction levels • conceptual • semantic integration of domain knowledge, expert (business) rules and extracted knowledge • semantic integration of different analysis paradigms • logical/physical • interoperability with external components: DBMS’s, data mining tools, desktop tools • querying/mining optimization: loose vs. tight coupling between query language and specialized mining tools EDBT2000 tutorial - KDSE

  5. Why are these applications challenging? Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing p(x)=0.02 Data Consolidation Patterns & Models Prepared Data Warehouse Consolidated Data Data Sources • The associated KDD processneeds to be carefully specified, tuned and controlled EDBT2000 tutorial - KDSE

  6. Why are these applications challenging? • Still not properly supported by available KDD technology • what is offered: horizontal, customizable toolkits/suites of data mining primitives • what is needed: KD support environments for vertical applications EDBT2000 tutorial - KDSE

  7. Traditional Focus on knowledge transfer, design and coding 30% - analysis and design 70% - program design, coding and testing Prototyping - expensive Development process has few loops Maintenance requires human analysis Data mining Focus on data selection, representation and search 70% - data preparation 30% - model generation and testing Prototyping - cheap Development process is inherently iterative Maintenance requires re-learning model Datamining vs. traditional Swdevelopment process EDBT2000 tutorial - KDSE

  8. From R. Agrawal’s invited lecture @ KDD’99 Chasm Mainstream Market Early Market The greatest peril in the development of a high-tech market lies in making the transition from an early market dominated by a few visionaries to a mainstream market dominated by pragmatists. EDBT2000 tutorial - KDSE

  9. Is data mining in the chasm? • Perceived to be sophisticated technology, usable only by specialists • Long, expensive projects • Stand-alone, loosely-coupled with data infrastructures • Difficult to infuse into existing mission-critical applications EDBT2000 tutorial - KDSE

  10. Module outline • Data analysis and KD Support Environments • Data mining technology trends • from tools … • … to suites … • … to solutions • Towards data mining query languages • DATASIFT: a logic-based KDSE • Future research challenges EDBT2000 tutorial - KDSE

  11. Generation 1: data mining tools • ~1980: first generation of DM systems • research-driven tools for single tasks, e.g. • build a decision tree - say C4.5 • find clusters - say Autoclass (Cheeseman 88) • … • Difficult to use more than one tool on the same data – lots of data/metadata transformation • Intended user: a specialist, technically sophisticated. EDBT2000 tutorial - KDSE

  12. Generation 2: data mining suites • ~1995: second generation of DM systems • toolkits for multiple tasks with support for data preparation and interoperability with DBMS, e.g. • SPSS Clementine • IBM Intelligent Miner • SAS Enterprise Miner • SFU DBMiner • Intended user: data analyst – suites require significant knowledge of statistics and databases EDBT2000 tutorial - KDSE

  13. Growth of DM tools (source: kdnuggets.com) • From G. Piatetsky-Shapiro. The data-mining industry coming of age. IEEE Intelligent Systems, Dec. 1999. EDBT2000 tutorial - KDSE

  14. Generation 3: data mining solutions • Beginning end of 1990s • vertical data mining-based applications and solutions oriented to solving one specific business problem, e.g. • detecting credit card fraud • customer retention • … • Address entire KDD process, and push result into a front-end application • Intended user: business user – the interfaces hid the data mining complexity EDBT2000 tutorial - KDSE

  15. Emerging short-term technology trends • Tighter interoperability by means of standards which facilitate the integration of data mining with other applications: • KDD process, e.g. the Cross-Industry Standard Process for Data Mining model (www.crisp-dm.org) • representation of mining models: e.g., the PMML - predictive modeling markup language (www.dmg.org) • DB interoperability: the Microsoft OLE DB for data mining interface EDBT2000 tutorial - KDSE

  16. Approaches in data mining suites • Database-oriented approach • IBM Intelligent Miner • OLAP-based mining • DBMiner - Jiawei Han’s group @ SFU • Machine learning • CART, ID3/C4.5/C5.0, Angoss Knowledge Studio • Statistical approaches • The SAS Institute Enterprise Miner. • Visualization approach: • SGI MineSet, VisDB (Keim et al. 94). EDBT2000 tutorial - KDSE

  17. Other approaches in data mining suites • Neural network approach: • Cognos 4thoughts, NeuroRule (Lu et al.’95). • Deductive DB integration: • KnowlegeMiner (Shen et al.’96) • Datasift (Pisa KDD Lab - see refs). • Rough sets, fuzzy sets: • Datalogic/R, 49er • Multi-strategy mining: • INLEN, KDW+, Explora EDBT2000 tutorial - KDSE

  18. SFU DBMiner: OLAP-centric mining Active Object Elements Warehouse Workplace Active Object EDBT2000 tutorial - KDSE

  19. IBM Intelligent Miner – DB-centric mining Contents Container Mining Base Container Work Area EDBT2000 tutorial - KDSE

  20. IBM – IM architecture EDBT2000 tutorial - KDSE

  21. Angoss Knowledge Studio: ML-centric mining Work Area Project Outline Additional Visualizations EDBT2000 tutorial - KDSE

  22. KS project outline tool • (Limited) support to the KDD process EDBT2000 tutorial - KDSE

  23. Support for data consolidation step • DBMiner • ODBC databases – SQL + SmartDrives • Single database – multiple tables • Consolidation of heterogeneous sources unsupported • Intelligent Miner • DB2 and text – SQL without SmartDrives • Multiple databases • Consolidation of heterogeneous sources supported • Knowledge Studio • ODBC databases and text • Single table • Consolidation of heterogeneous sources unsupported EDBT2000 tutorial - KDSE

  24. Support for selection and preprocessing • DBMiner • SQL only • Intelligent Miner • SQL + standard and advanced statistical functionalities • Knowledge Studio • descriptive statistics EDBT2000 tutorial - KDSE

  25. Support for data mining step • Knowledge Studio • Decision trees • Clustering • Prediction • DBMiner • Association rules • Decision trees • Prediction • Intelligent Miner • Associations rules • Sequential patterns • Clustering • Classification • Prediction • Similar time series EDBT2000 tutorial - KDSE

  26. Support for interpretation and evaluation • Predefined interestingness measures • Emphasis on visualization • Limited export capability of analysis results • Gain charts for comparison of predictive models (KS and IM) • Limited model combination capabilities (KS) EDBT2000 tutorial - KDSE

  27. Module outline • Data analysis and KD Support Environments • Data mining technology trends • from tools … • … to suites … • … to solutions • Towards data mining query languages • DATASIFT: a logic-based KDSE • Future research challenges EDBT2000 tutorial - KDSE

  28. Data Mining Query Languages • A DMQL can provide the ability to support ad-hoc and interactive data mining • Hope: achieve the same effect that SQL had on relational databases. • Various proposals: • DMQL (Han et al 96) • mine operator (Meo et el 96) • M-SQL (Imielinski et al 99) • query flocks (Tsur et al 98) EDBT2000 tutorial - KDSE

  29. MINE operator of (Meo et al 96) EDBT2000 tutorial - KDSE

  30. References - DMQL • J. Han, Y. Fu, W. Wang, K. Koperski, and O. R. Zaiane. DMQL: A Data Mining Query Language for Relational Databases. In Proc. 1996 SIGMOD'96 Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'96), pp. 27-33, Montreal, Canada, June 1996. • R. Meo, G. Psaila, S. Ceri. A New SQL-like Operator for Mining Association Rules. In Proc. VLDB96, 1996 Int. Conf. Very Large Data Bases, Bombay, India, pp. 122-133, Sept. 1996. • T. Imielinski and A. Virmani. MSQL: a query language for database mining. Data Mining and Knowledge Discovery, 3:373-408, 1999. • S. Tsur, J. Ulman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov. Query flocks: a generalization of association rule mining. In Proc. 1998 ACM-SIGMOD, p. 1-12, 1998. EDBT2000 tutorial - KDSE

  31. Module outline • Data analysis and KD Support Environments • Data mining technology trends • from tools … • … to suites … • … to solutions • Towards data mining query languages • DATASIFT: a logic-based KDSE • Future research challenges EDBT2000 tutorial - KDSE

  32. DATASIFT - towards a logic-based KDSE • DATASIFT is LDL++ (Logic Data Language, MCC & UCLA) extended with mining primitives (decision trees & association rules) • LDL++ syntax: Prolog-like deductive rules • LDL++ semantics: SQL extended with recursion (and more) • Integration of deduction and induction • Employed to systematically develop the methodology for MBA and audit planning • See Pisa KDD Lab references EDBT2000 tutorial - KDSE

  33. Our position • A suitable integration of • deductive reasoning (logic database languages) • inductive reasoning (association rules & decision trees) • provides a viable solution to high-level problems in knowledge-intensive data analysis applications EDBT2000 tutorial - KDSE

  34. Our goal • Demonstrate how we support design and control of the overall KDD process and the incorporation of background knowledge • data preparation • knowledge extraction • post-processing and knowledge evaluation • business rules • autofocus datamining EDBT2000 tutorial - KDSE

  35. With respect to other DMQL’s • extending logic query languages yields extra expressiveness, needed to bridge the gap between • data mining (e.g., association rule mining) • vertical applications (e.g., market basket analysis) EDBT2000 tutorial - KDSE

  36. Architecture - client agent • User interface • Access to business rules and visualization of results through • web browser to control interaction • MS Excel objects (sheets and charts) to represent output of analysis (association rules) EDBT2000 tutorial - KDSE

  37. Architecture - server agent • A query engine (mediator) • record previous analyses • Metadata/meta knowledge • interaction with other components • LDL++ server • extended with external calls to DBMSs and to … • Inductive modules • Apriori • classifiers (decision trees) • Coupling with DBMS using the Cache-mine approach • Performance comparable with SQL-based approaches on same mining queries (Giannotti at el 2000) EDBT2000 tutorial - KDSE

  38. Deductive rules in LDL++ • A small database of cash register transactions • basket(1,fish). basket(2,bread). basket(3,bread). • basket(1,bread). basket(2,milk).basket(3,orange). • basket(2,onions). basket(3,milk). • basket(2,fish). • E.g.: select transactions involving milk milk_basket(T,I)  basket(T,I),basket(T,milk). • Querying?- milk_basket(T,I) milk_basket(2,bread). milk_basket(3,bread). milk_ basket(2,milk). milk_basket(3,orange). milk_ basket(2,onions). milk_basket(3,milk). milk_ basket(2,fish). EDBT2000 tutorial - KDSE

  39. Aggregates in LDL++ • A small database of cash register transactions • basket(1,fish). basket(2,bread). basket(3,bread). • basket(1,bread). basket(2,milk).basket(3,orange). • basket(2,onions). basket(3,milk). • basket(2,fish). • E.g.: count occurrences of pairs of distinct items in all transactions pair(I1,I2,count<T>) basket(T,I1),basket(T,I2),I1I2. aggregate • Querying?- pair(fish,bread,N) pair(fish,bread,2)(i.e., N=2) • Aggregates are the logical interface between deductive and inductive environment. EDBT2000 tutorial - KDSE

  40. Association rules in LDL++ • basket(1,fish). basket(2,bread). basket(3,bread). • basket(1,bread). basket(2,milk).basket(3,orange). • basket(2,onions). basket(3,milk). • basket(2,fish). • E.g., compute one-to-one association rules with at least 40% support rules(patterns<0.4,0,{I1,I2}>)basket(T,I1),basket(T,I2). patterns • is the aggregate interfacing the computation of association rules • patterns<min_supp, min_conf, trans_set> EDBT2000 tutorial - KDSE

  41. Association rules in LDL++ • basket(1,fish). basket(2,bread). basket(3,bread). • basket(1,bread). basket(2,milk).basket(3,orange). • basket(2,onions). basket(3,milk). • basket(2,fish). • Result of the query ?- rules(X,Y,S,C) rules({milk},{bread},0.66,1) i.e. milk  bread [0.66,1] rules({bread},{milk},0.66,0.66) rules({fish},{bread},0.66,1) rules({bread},{fish},0.66,0.66) • Same status for data and induced rules EDBT2000 tutorial - KDSE

  42. Reasoning on item hierarchies • Which rules survive/decay up/down the item hierarchy? rules_at_level(I,pattern<S,C,Itemset>)  itemset_abstraction(I,Tid,Itemset). preserved_rules(Left,Right)  rules_at_level(I,Left,Right,_,_), rules_at_level(I+1,Left,Right,_,_). EDBT2000 tutorial - KDSE

  43. Business rules: reasoning on promotions • Which rules are established by a promotion? interval(before, -, 3/7/1998). interval(promotion, 3/8/1998, 3/30/1998). interval(after, 3/31/1998, +). established_rules(Left, Right)  not rules_partition(before, Left, Right, _, _), rules_partition(promotion, Left, Right, _, _), rules_partition(after, Left, Right, _, _). EDBT2000 tutorial - KDSE

  44. Business rules: temporal reasoning • How does rule support change along time? EDBT2000 tutorial - KDSE

  45. Decision tree construction in DATASIFT • construct training and test set using rules training_set(P,Case_list)  ... test_tuple(ID,F1,...,F20,Rec,Act_rec,CAR)  ... • construct classifier using external call to C5.0 tree_rules(Tree_name,P,PF,MC,BO,Rule_list) training_set(P,Case_list),tree_induction(Case_list,PF,MC,BO,Rule_list). • parameters • pruning factor PF • misclassification costs MC • boosting BO external call induced classifier EDBT2000 tutorial - KDSE

  46. Putting decision trees at work • prediction of target variable prediction(Tree_name,ID,CAR,Predicted_CAR) tree_rules(Tree_name, _ ,_ , _ , Rule_list),test_subject(ID, F1, …, F20, _, _, CAR),classify(Rule_list ,[F1, …, F20], Predicted_CAR). • Model evaluation: actual recovery of a classifier (=sum recovery of tuples classified as positive) actual_recovery(Tree_name,sum<Actual_Recovery>) prediction(Tree_name, ID, _ , pos),test_subject(ID, F1, …, F20, _,Actual_Recovery, _). aggregate EDBT2000 tutorial - KDSE

  47. Combining decision trees • Model conjunction: tree_conjunction(T1,T2,ID,CAR,pos) prediction(T1, ID, CAR, pos),prediction(T2, ID, CAR, pos). tree_conjunction (T1, T2, ID, CAR, neg)  test_subject(ID, F1, …, F20, _, _, CAR), ~ tree_conjunction(T1, T2, ID, CAR, pos). • More interesting combinations readily expressible: • e.g. meta learning (Chan and Stolfo 93) EDBT2000 tutorial - KDSE

  48. We proposed ... • a KDD methodology for audit planning: • define an audit cost model • monitor training- and test-set construction • assess the quality of a classifier • tune classifier construction to specific policies • and its formalization in a prototype logic-based KDSE, supporting: • integration of deduction and induction • integration of domain and induced knowledge • separation of conceptual and implementation level EDBT2000 tutorial - KDSE

  49. Module outline • Data analysis and KD Support Environments • Data mining technology trends • from tools … • … to suites … • … to solutions • Towards data mining query languages • DATASIFT: a logic-based KDSE • Future research challenges EDBT2000 tutorial - KDSE

  50. A data mining research agenda • Integration with data warehouse and relational DB • Scalable, parallel/distributed and incremental mining • Data mining query language optimization • Multiple, integrated data mining methods • KDSE and methodological support for vertical appl. • Interactive, exploratory data mining environments • Mining on other forms of data: • spatio-temporal databases • text • multimedia • web EDBT2000 tutorial - KDSE

More Related