Knowledge discovery & data mining Towards KD Support Environments

Knowledge discovery & data mining Towards KD Support Environments Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa http://www-kdd.di.unipi.it/ A tutorial @ EDBT2000

Module outline • Data analysis and KD Support Environments • Data mining technology trends • from tools … • … to suites • … to solutions • Towards data mining query languages • DATASIFT: a logic-based KDSE • Future research challenges EDBT2000 tutorial - KDSE

Vertical applications • We outlined three classes of vertical data analysis applications that can be tackled using KDD & DM techniques • Fraud detection • Market basket analysis • Customer segmentation EDBT2000 tutorial - KDSE

Why are these applications challenging? • Require manipulation and reasoning over knowledge and data at different abstraction levels • conceptual • semantic integration of domain knowledge, expert (business) rules and extracted knowledge • semantic integration of different analysis paradigms • logical/physical • interoperability with external components: DBMS’s, data mining tools, desktop tools • querying/mining optimization: loose vs. tight coupling between query language and specialized mining tools EDBT2000 tutorial - KDSE

Why are these applications challenging? Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing p(x)=0.02 Data Consolidation Patterns & Models Prepared Data Warehouse Consolidated Data Data Sources • The associated KDD processneeds to be carefully specified, tuned and controlled EDBT2000 tutorial - KDSE

Why are these applications challenging? • Still not properly supported by available KDD technology • what is offered: horizontal, customizable toolkits/suites of data mining primitives • what is needed: KD support environments for vertical applications EDBT2000 tutorial - KDSE

Traditional Focus on knowledge transfer, design and coding 30% - analysis and design 70% - program design, coding and testing Prototyping - expensive Development process has few loops Maintenance requires human analysis Data mining Focus on data selection, representation and search 70% - data preparation 30% - model generation and testing Prototyping - cheap Development process is inherently iterative Maintenance requires re-learning model Datamining vs. traditional Swdevelopment process EDBT2000 tutorial - KDSE

From R. Agrawal’s invited lecture @ KDD’99 Chasm Mainstream Market Early Market The greatest peril in the development of a high-tech market lies in making the transition from an early market dominated by a few visionaries to a mainstream market dominated by pragmatists. EDBT2000 tutorial - KDSE

Is data mining in the chasm? • Perceived to be sophisticated technology, usable only by specialists • Long, expensive projects • Stand-alone, loosely-coupled with data infrastructures • Difficult to infuse into existing mission-critical applications EDBT2000 tutorial - KDSE

Module outline • Data analysis and KD Support Environments • Data mining technology trends • from tools … • … to suites … • … to solutions • Towards data mining query languages • DATASIFT: a logic-based KDSE • Future research challenges EDBT2000 tutorial - KDSE

Generation 1: data mining tools • ~1980: first generation of DM systems • research-driven tools for single tasks, e.g. • build a decision tree - say C4.5 • find clusters - say Autoclass (Cheeseman 88) • … • Difficult to use more than one tool on the same data – lots of data/metadata transformation • Intended user: a specialist, technically sophisticated. EDBT2000 tutorial - KDSE

Generation 2: data mining suites • ~1995: second generation of DM systems • toolkits for multiple tasks with support for data preparation and interoperability with DBMS, e.g. • SPSS Clementine • IBM Intelligent Miner • SAS Enterprise Miner • SFU DBMiner • Intended user: data analyst – suites require significant knowledge of statistics and databases EDBT2000 tutorial - KDSE

Growth of DM tools (source: kdnuggets.com) • From G. Piatetsky-Shapiro. The data-mining industry coming of age. IEEE Intelligent Systems, Dec. 1999. EDBT2000 tutorial - KDSE

Generation 3: data mining solutions • Beginning end of 1990s • vertical data mining-based applications and solutions oriented to solving one specific business problem, e.g. • detecting credit card fraud • customer retention • … • Address entire KDD process, and push result into a front-end application • Intended user: business user – the interfaces hid the data mining complexity EDBT2000 tutorial - KDSE

Emerging short-term technology trends • Tighter interoperability by means of standards which facilitate the integration of data mining with other applications: • KDD process, e.g. the Cross-Industry Standard Process for Data Mining model (www.crisp-dm.org) • representation of mining models: e.g., the PMML - predictive modeling markup language (www.dmg.org) • DB interoperability: the Microsoft OLE DB for data mining interface EDBT2000 tutorial - KDSE

Approaches in data mining suites • Database-oriented approach • IBM Intelligent Miner • OLAP-based mining • DBMiner - Jiawei Han’s group @ SFU • Machine learning • CART, ID3/C4.5/C5.0, Angoss Knowledge Studio • Statistical approaches • The SAS Institute Enterprise Miner. • Visualization approach: • SGI MineSet, VisDB (Keim et al. 94). EDBT2000 tutorial - KDSE

Other approaches in data mining suites • Neural network approach: • Cognos 4thoughts, NeuroRule (Lu et al.’95). • Deductive DB integration: • KnowlegeMiner (Shen et al.’96) • Datasift (Pisa KDD Lab - see refs). • Rough sets, fuzzy sets: • Datalogic/R, 49er • Multi-strategy mining: • INLEN, KDW+, Explora EDBT2000 tutorial - KDSE

SFU DBMiner: OLAP-centric mining Active Object Elements Warehouse Workplace Active Object EDBT2000 tutorial - KDSE

IBM Intelligent Miner – DB-centric mining Contents Container Mining Base Container Work Area EDBT2000 tutorial - KDSE

IBM – IM architecture EDBT2000 tutorial - KDSE

Angoss Knowledge Studio: ML-centric mining Work Area Project Outline Additional Visualizations EDBT2000 tutorial - KDSE

KS project outline tool • (Limited) support to the KDD process EDBT2000 tutorial - KDSE

Support for data consolidation step • DBMiner • ODBC databases – SQL + SmartDrives • Single database – multiple tables • Consolidation of heterogeneous sources unsupported • Intelligent Miner • DB2 and text – SQL without SmartDrives • Multiple databases • Consolidation of heterogeneous sources supported • Knowledge Studio • ODBC databases and text • Single table • Consolidation of heterogeneous sources unsupported EDBT2000 tutorial - KDSE

Support for selection and preprocessing • DBMiner • SQL only • Intelligent Miner • SQL + standard and advanced statistical functionalities • Knowledge Studio • descriptive statistics EDBT2000 tutorial - KDSE

Support for data mining step • Knowledge Studio • Decision trees • Clustering • Prediction • DBMiner • Association rules • Decision trees • Prediction • Intelligent Miner • Associations rules • Sequential patterns • Clustering • Classification • Prediction • Similar time series EDBT2000 tutorial - KDSE

Support for interpretation and evaluation • Predefined interestingness measures • Emphasis on visualization • Limited export capability of analysis results • Gain charts for comparison of predictive models (KS and IM) • Limited model combination capabilities (KS) EDBT2000 tutorial - KDSE

Data Mining Query Languages • A DMQL can provide the ability to support ad-hoc and interactive data mining • Hope: achieve the same effect that SQL had on relational databases. • Various proposals: • DMQL (Han et al 96) • mine operator (Meo et el 96) • M-SQL (Imielinski et al 99) • query flocks (Tsur et al 98) EDBT2000 tutorial - KDSE

MINE operator of (Meo et al 96) EDBT2000 tutorial - KDSE

References - DMQL • J. Han, Y. Fu, W. Wang, K. Koperski, and O. R. Zaiane. DMQL: A Data Mining Query Language for Relational Databases. In Proc. 1996 SIGMOD'96 Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'96), pp. 27-33, Montreal, Canada, June 1996. • R. Meo, G. Psaila, S. Ceri. A New SQL-like Operator for Mining Association Rules. In Proc. VLDB96, 1996 Int. Conf. Very Large Data Bases, Bombay, India, pp. 122-133, Sept. 1996. • T. Imielinski and A. Virmani. MSQL: a query language for database mining. Data Mining and Knowledge Discovery, 3:373-408, 1999. • S. Tsur, J. Ulman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov. Query flocks: a generalization of association rule mining. In Proc. 1998 ACM-SIGMOD, p. 1-12, 1998. EDBT2000 tutorial - KDSE

DATASIFT - towards a logic-based KDSE • DATASIFT is LDL++ (Logic Data Language, MCC & UCLA) extended with mining primitives (decision trees & association rules) • LDL++ syntax: Prolog-like deductive rules • LDL++ semantics: SQL extended with recursion (and more) • Integration of deduction and induction • Employed to systematically develop the methodology for MBA and audit planning • See Pisa KDD Lab references EDBT2000 tutorial - KDSE

Our position • A suitable integration of • deductive reasoning (logic database languages) • inductive reasoning (association rules & decision trees) • provides a viable solution to high-level problems in knowledge-intensive data analysis applications EDBT2000 tutorial - KDSE

Our goal • Demonstrate how we support design and control of the overall KDD process and the incorporation of background knowledge • data preparation • knowledge extraction • post-processing and knowledge evaluation • business rules • autofocus datamining EDBT2000 tutorial - KDSE

With respect to other DMQL’s • extending logic query languages yields extra expressiveness, needed to bridge the gap between • data mining (e.g., association rule mining) • vertical applications (e.g., market basket analysis) EDBT2000 tutorial - KDSE

Architecture - client agent • User interface • Access to business rules and visualization of results through • web browser to control interaction • MS Excel objects (sheets and charts) to represent output of analysis (association rules) EDBT2000 tutorial - KDSE

Architecture - server agent • A query engine (mediator) • record previous analyses • Metadata/meta knowledge • interaction with other components • LDL++ server • extended with external calls to DBMSs and to … • Inductive modules • Apriori • classifiers (decision trees) • Coupling with DBMS using the Cache-mine approach • Performance comparable with SQL-based approaches on same mining queries (Giannotti at el 2000) EDBT2000 tutorial - KDSE

Deductive rules in LDL++ • A small database of cash register transactions • basket(1,fish). basket(2,bread). basket(3,bread). • basket(1,bread). basket(2,milk).basket(3,orange). • basket(2,onions). basket(3,milk). • basket(2,fish). • E.g.: select transactions involving milk milk_basket(T,I)  basket(T,I),basket(T,milk). • Querying?- milk_basket(T,I) milk_basket(2,bread). milk_basket(3,bread). milk_ basket(2,milk). milk_basket(3,orange). milk_ basket(2,onions). milk_basket(3,milk). milk_ basket(2,fish). EDBT2000 tutorial - KDSE

Aggregates in LDL++ • A small database of cash register transactions • basket(1,fish). basket(2,bread). basket(3,bread). • basket(1,bread). basket(2,milk).basket(3,orange). • basket(2,onions). basket(3,milk). • basket(2,fish). • E.g.: count occurrences of pairs of distinct items in all transactions pair(I1,I2,count<T>) basket(T,I1),basket(T,I2),I1I2. aggregate • Querying?- pair(fish,bread,N) pair(fish,bread,2)(i.e., N=2) • Aggregates are the logical interface between deductive and inductive environment. EDBT2000 tutorial - KDSE

Association rules in LDL++ • basket(1,fish). basket(2,bread). basket(3,bread). • basket(1,bread). basket(2,milk).basket(3,orange). • basket(2,onions). basket(3,milk). • basket(2,fish). • E.g., compute one-to-one association rules with at least 40% support rules(patterns<0.4,0,{I1,I2}>)basket(T,I1),basket(T,I2). patterns • is the aggregate interfacing the computation of association rules • patterns<min_supp, min_conf, trans_set> EDBT2000 tutorial - KDSE

Association rules in LDL++ • basket(1,fish). basket(2,bread). basket(3,bread). • basket(1,bread). basket(2,milk).basket(3,orange). • basket(2,onions). basket(3,milk). • basket(2,fish). • Result of the query ?- rules(X,Y,S,C) rules({milk},{bread},0.66,1) i.e. milk  bread [0.66,1] rules({bread},{milk},0.66,0.66) rules({fish},{bread},0.66,1) rules({bread},{fish},0.66,0.66) • Same status for data and induced rules EDBT2000 tutorial - KDSE

Reasoning on item hierarchies • Which rules survive/decay up/down the item hierarchy? rules_at_level(I,pattern<S,C,Itemset>)  itemset_abstraction(I,Tid,Itemset). preserved_rules(Left,Right)  rules_at_level(I,Left,Right,_,_), rules_at_level(I+1,Left,Right,_,_). EDBT2000 tutorial - KDSE

Business rules: reasoning on promotions • Which rules are established by a promotion? interval(before, -, 3/7/1998). interval(promotion, 3/8/1998, 3/30/1998). interval(after, 3/31/1998, +). established_rules(Left, Right)  not rules_partition(before, Left, Right, _, _), rules_partition(promotion, Left, Right, _, _), rules_partition(after, Left, Right, _, _). EDBT2000 tutorial - KDSE

Business rules: temporal reasoning • How does rule support change along time? EDBT2000 tutorial - KDSE

Decision tree construction in DATASIFT • construct training and test set using rules training_set(P,Case_list)  ... test_tuple(ID,F1,...,F20,Rec,Act_rec,CAR)  ... • construct classifier using external call to C5.0 tree_rules(Tree_name,P,PF,MC,BO,Rule_list) training_set(P,Case_list),tree_induction(Case_list,PF,MC,BO,Rule_list). • parameters • pruning factor PF • misclassification costs MC • boosting BO external call induced classifier EDBT2000 tutorial - KDSE

Putting decision trees at work • prediction of target variable prediction(Tree_name,ID,CAR,Predicted_CAR) tree_rules(Tree_name, _ ,_ , _ , Rule_list),test_subject(ID, F1, …, F20, _, _, CAR),classify(Rule_list ,[F1, …, F20], Predicted_CAR). • Model evaluation: actual recovery of a classifier (=sum recovery of tuples classified as positive) actual_recovery(Tree_name,sum<Actual_Recovery>) prediction(Tree_name, ID, _ , pos),test_subject(ID, F1, …, F20, _,Actual_Recovery, _). aggregate EDBT2000 tutorial - KDSE

Combining decision trees • Model conjunction: tree_conjunction(T1,T2,ID,CAR,pos) prediction(T1, ID, CAR, pos),prediction(T2, ID, CAR, pos). tree_conjunction (T1, T2, ID, CAR, neg)  test_subject(ID, F1, …, F20, _, _, CAR), ~ tree_conjunction(T1, T2, ID, CAR, pos). • More interesting combinations readily expressible: • e.g. meta learning (Chan and Stolfo 93) EDBT2000 tutorial - KDSE

We proposed ... • a KDD methodology for audit planning: • define an audit cost model • monitor training- and test-set construction • assess the quality of a classifier • tune classifier construction to specific policies • and its formalization in a prototype logic-based KDSE, supporting: • integration of deduction and induction • integration of domain and induced knowledge • separation of conceptual and implementation level EDBT2000 tutorial - KDSE

A data mining research agenda • Integration with data warehouse and relational DB • Scalable, parallel/distributed and incremental mining • Data mining query language optimization • Multiple, integrated data mining methods • KDSE and methodological support for vertical appl. • Interactive, exploratory data mining environments • Mining on other forms of data: • spatio-temporal databases • text • multimedia • web EDBT2000 tutorial - KDSE

Knowledge discovery & data mining Towards KD Support Environments