630 likes | 644 Vues
Tutorial on challenges and trends in developing data mining solutions for vertical applications like fraud detection, market analysis, and customer segmentation. Covers data consolidation, process optimization, and evolving technology trends.
E N D
Knowledge discovery & data mining Towards KD Support Environments Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa http://www-kdd.di.unipi.it/ A tutorial @ EDBT2000
Module outline • Data analysis and KD Support Environments • Data mining technology trends • from tools … • … to suites • … to solutions • Towards data mining query languages • DATASIFT: a logic-based KDSE • Future research challenges EDBT2000 tutorial - KDSE
Vertical applications • We outlined three classes of vertical data analysis applications that can be tackled using KDD & DM techniques • Fraud detection • Market basket analysis • Customer segmentation EDBT2000 tutorial - KDSE
Why are these applications challenging? • Require manipulation and reasoning over knowledge and data at different abstraction levels • conceptual • semantic integration of domain knowledge, expert (business) rules and extracted knowledge • semantic integration of different analysis paradigms • logical/physical • interoperability with external components: DBMS’s, data mining tools, desktop tools • querying/mining optimization: loose vs. tight coupling between query language and specialized mining tools EDBT2000 tutorial - KDSE
Why are these applications challenging? Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing p(x)=0.02 Data Consolidation Patterns & Models Prepared Data Warehouse Consolidated Data Data Sources • The associated KDD processneeds to be carefully specified, tuned and controlled EDBT2000 tutorial - KDSE
Why are these applications challenging? • Still not properly supported by available KDD technology • what is offered: horizontal, customizable toolkits/suites of data mining primitives • what is needed: KD support environments for vertical applications EDBT2000 tutorial - KDSE
Traditional Focus on knowledge transfer, design and coding 30% - analysis and design 70% - program design, coding and testing Prototyping - expensive Development process has few loops Maintenance requires human analysis Data mining Focus on data selection, representation and search 70% - data preparation 30% - model generation and testing Prototyping - cheap Development process is inherently iterative Maintenance requires re-learning model Datamining vs. traditional Swdevelopment process EDBT2000 tutorial - KDSE
From R. Agrawal’s invited lecture @ KDD’99 Chasm Mainstream Market Early Market The greatest peril in the development of a high-tech market lies in making the transition from an early market dominated by a few visionaries to a mainstream market dominated by pragmatists. EDBT2000 tutorial - KDSE
Is data mining in the chasm? • Perceived to be sophisticated technology, usable only by specialists • Long, expensive projects • Stand-alone, loosely-coupled with data infrastructures • Difficult to infuse into existing mission-critical applications EDBT2000 tutorial - KDSE
Module outline • Data analysis and KD Support Environments • Data mining technology trends • from tools … • … to suites … • … to solutions • Towards data mining query languages • DATASIFT: a logic-based KDSE • Future research challenges EDBT2000 tutorial - KDSE
Generation 1: data mining tools • ~1980: first generation of DM systems • research-driven tools for single tasks, e.g. • build a decision tree - say C4.5 • find clusters - say Autoclass (Cheeseman 88) • … • Difficult to use more than one tool on the same data – lots of data/metadata transformation • Intended user: a specialist, technically sophisticated. EDBT2000 tutorial - KDSE
Generation 2: data mining suites • ~1995: second generation of DM systems • toolkits for multiple tasks with support for data preparation and interoperability with DBMS, e.g. • SPSS Clementine • IBM Intelligent Miner • SAS Enterprise Miner • SFU DBMiner • Intended user: data analyst – suites require significant knowledge of statistics and databases EDBT2000 tutorial - KDSE
Growth of DM tools (source: kdnuggets.com) • From G. Piatetsky-Shapiro. The data-mining industry coming of age. IEEE Intelligent Systems, Dec. 1999. EDBT2000 tutorial - KDSE
Generation 3: data mining solutions • Beginning end of 1990s • vertical data mining-based applications and solutions oriented to solving one specific business problem, e.g. • detecting credit card fraud • customer retention • … • Address entire KDD process, and push result into a front-end application • Intended user: business user – the interfaces hid the data mining complexity EDBT2000 tutorial - KDSE
Emerging short-term technology trends • Tighter interoperability by means of standards which facilitate the integration of data mining with other applications: • KDD process, e.g. the Cross-Industry Standard Process for Data Mining model (www.crisp-dm.org) • representation of mining models: e.g., the PMML - predictive modeling markup language (www.dmg.org) • DB interoperability: the Microsoft OLE DB for data mining interface EDBT2000 tutorial - KDSE
Approaches in data mining suites • Database-oriented approach • IBM Intelligent Miner • OLAP-based mining • DBMiner - Jiawei Han’s group @ SFU • Machine learning • CART, ID3/C4.5/C5.0, Angoss Knowledge Studio • Statistical approaches • The SAS Institute Enterprise Miner. • Visualization approach: • SGI MineSet, VisDB (Keim et al. 94). EDBT2000 tutorial - KDSE
Other approaches in data mining suites • Neural network approach: • Cognos 4thoughts, NeuroRule (Lu et al.’95). • Deductive DB integration: • KnowlegeMiner (Shen et al.’96) • Datasift (Pisa KDD Lab - see refs). • Rough sets, fuzzy sets: • Datalogic/R, 49er • Multi-strategy mining: • INLEN, KDW+, Explora EDBT2000 tutorial - KDSE
SFU DBMiner: OLAP-centric mining Active Object Elements Warehouse Workplace Active Object EDBT2000 tutorial - KDSE
IBM Intelligent Miner – DB-centric mining Contents Container Mining Base Container Work Area EDBT2000 tutorial - KDSE
IBM – IM architecture EDBT2000 tutorial - KDSE
Angoss Knowledge Studio: ML-centric mining Work Area Project Outline Additional Visualizations EDBT2000 tutorial - KDSE
KS project outline tool • (Limited) support to the KDD process EDBT2000 tutorial - KDSE
Support for data consolidation step • DBMiner • ODBC databases – SQL + SmartDrives • Single database – multiple tables • Consolidation of heterogeneous sources unsupported • Intelligent Miner • DB2 and text – SQL without SmartDrives • Multiple databases • Consolidation of heterogeneous sources supported • Knowledge Studio • ODBC databases and text • Single table • Consolidation of heterogeneous sources unsupported EDBT2000 tutorial - KDSE
Support for selection and preprocessing • DBMiner • SQL only • Intelligent Miner • SQL + standard and advanced statistical functionalities • Knowledge Studio • descriptive statistics EDBT2000 tutorial - KDSE
Support for data mining step • Knowledge Studio • Decision trees • Clustering • Prediction • DBMiner • Association rules • Decision trees • Prediction • Intelligent Miner • Associations rules • Sequential patterns • Clustering • Classification • Prediction • Similar time series EDBT2000 tutorial - KDSE
Support for interpretation and evaluation • Predefined interestingness measures • Emphasis on visualization • Limited export capability of analysis results • Gain charts for comparison of predictive models (KS and IM) • Limited model combination capabilities (KS) EDBT2000 tutorial - KDSE
Module outline • Data analysis and KD Support Environments • Data mining technology trends • from tools … • … to suites … • … to solutions • Towards data mining query languages • DATASIFT: a logic-based KDSE • Future research challenges EDBT2000 tutorial - KDSE
Data Mining Query Languages • A DMQL can provide the ability to support ad-hoc and interactive data mining • Hope: achieve the same effect that SQL had on relational databases. • Various proposals: • DMQL (Han et al 96) • mine operator (Meo et el 96) • M-SQL (Imielinski et al 99) • query flocks (Tsur et al 98) EDBT2000 tutorial - KDSE
MINE operator of (Meo et al 96) EDBT2000 tutorial - KDSE
References - DMQL • J. Han, Y. Fu, W. Wang, K. Koperski, and O. R. Zaiane. DMQL: A Data Mining Query Language for Relational Databases. In Proc. 1996 SIGMOD'96 Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'96), pp. 27-33, Montreal, Canada, June 1996. • R. Meo, G. Psaila, S. Ceri. A New SQL-like Operator for Mining Association Rules. In Proc. VLDB96, 1996 Int. Conf. Very Large Data Bases, Bombay, India, pp. 122-133, Sept. 1996. • T. Imielinski and A. Virmani. MSQL: a query language for database mining. Data Mining and Knowledge Discovery, 3:373-408, 1999. • S. Tsur, J. Ulman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov. Query flocks: a generalization of association rule mining. In Proc. 1998 ACM-SIGMOD, p. 1-12, 1998. EDBT2000 tutorial - KDSE
Module outline • Data analysis and KD Support Environments • Data mining technology trends • from tools … • … to suites … • … to solutions • Towards data mining query languages • DATASIFT: a logic-based KDSE • Future research challenges EDBT2000 tutorial - KDSE
DATASIFT - towards a logic-based KDSE • DATASIFT is LDL++ (Logic Data Language, MCC & UCLA) extended with mining primitives (decision trees & association rules) • LDL++ syntax: Prolog-like deductive rules • LDL++ semantics: SQL extended with recursion (and more) • Integration of deduction and induction • Employed to systematically develop the methodology for MBA and audit planning • See Pisa KDD Lab references EDBT2000 tutorial - KDSE
Our position • A suitable integration of • deductive reasoning (logic database languages) • inductive reasoning (association rules & decision trees) • provides a viable solution to high-level problems in knowledge-intensive data analysis applications EDBT2000 tutorial - KDSE
Our goal • Demonstrate how we support design and control of the overall KDD process and the incorporation of background knowledge • data preparation • knowledge extraction • post-processing and knowledge evaluation • business rules • autofocus datamining EDBT2000 tutorial - KDSE
With respect to other DMQL’s • extending logic query languages yields extra expressiveness, needed to bridge the gap between • data mining (e.g., association rule mining) • vertical applications (e.g., market basket analysis) EDBT2000 tutorial - KDSE
Architecture - client agent • User interface • Access to business rules and visualization of results through • web browser to control interaction • MS Excel objects (sheets and charts) to represent output of analysis (association rules) EDBT2000 tutorial - KDSE
Architecture - server agent • A query engine (mediator) • record previous analyses • Metadata/meta knowledge • interaction with other components • LDL++ server • extended with external calls to DBMSs and to … • Inductive modules • Apriori • classifiers (decision trees) • Coupling with DBMS using the Cache-mine approach • Performance comparable with SQL-based approaches on same mining queries (Giannotti at el 2000) EDBT2000 tutorial - KDSE
Deductive rules in LDL++ • A small database of cash register transactions • basket(1,fish). basket(2,bread). basket(3,bread). • basket(1,bread). basket(2,milk).basket(3,orange). • basket(2,onions). basket(3,milk). • basket(2,fish). • E.g.: select transactions involving milk milk_basket(T,I) basket(T,I),basket(T,milk). • Querying?- milk_basket(T,I) milk_basket(2,bread). milk_basket(3,bread). milk_ basket(2,milk). milk_basket(3,orange). milk_ basket(2,onions). milk_basket(3,milk). milk_ basket(2,fish). EDBT2000 tutorial - KDSE
Aggregates in LDL++ • A small database of cash register transactions • basket(1,fish). basket(2,bread). basket(3,bread). • basket(1,bread). basket(2,milk).basket(3,orange). • basket(2,onions). basket(3,milk). • basket(2,fish). • E.g.: count occurrences of pairs of distinct items in all transactions pair(I1,I2,count<T>) basket(T,I1),basket(T,I2),I1I2. aggregate • Querying?- pair(fish,bread,N) pair(fish,bread,2)(i.e., N=2) • Aggregates are the logical interface between deductive and inductive environment. EDBT2000 tutorial - KDSE
Association rules in LDL++ • basket(1,fish). basket(2,bread). basket(3,bread). • basket(1,bread). basket(2,milk).basket(3,orange). • basket(2,onions). basket(3,milk). • basket(2,fish). • E.g., compute one-to-one association rules with at least 40% support rules(patterns<0.4,0,{I1,I2}>)basket(T,I1),basket(T,I2). patterns • is the aggregate interfacing the computation of association rules • patterns<min_supp, min_conf, trans_set> EDBT2000 tutorial - KDSE
Association rules in LDL++ • basket(1,fish). basket(2,bread). basket(3,bread). • basket(1,bread). basket(2,milk).basket(3,orange). • basket(2,onions). basket(3,milk). • basket(2,fish). • Result of the query ?- rules(X,Y,S,C) rules({milk},{bread},0.66,1) i.e. milk bread [0.66,1] rules({bread},{milk},0.66,0.66) rules({fish},{bread},0.66,1) rules({bread},{fish},0.66,0.66) • Same status for data and induced rules EDBT2000 tutorial - KDSE
Reasoning on item hierarchies • Which rules survive/decay up/down the item hierarchy? rules_at_level(I,pattern<S,C,Itemset>) itemset_abstraction(I,Tid,Itemset). preserved_rules(Left,Right) rules_at_level(I,Left,Right,_,_), rules_at_level(I+1,Left,Right,_,_). EDBT2000 tutorial - KDSE
Business rules: reasoning on promotions • Which rules are established by a promotion? interval(before, -, 3/7/1998). interval(promotion, 3/8/1998, 3/30/1998). interval(after, 3/31/1998, +). established_rules(Left, Right) not rules_partition(before, Left, Right, _, _), rules_partition(promotion, Left, Right, _, _), rules_partition(after, Left, Right, _, _). EDBT2000 tutorial - KDSE
Business rules: temporal reasoning • How does rule support change along time? EDBT2000 tutorial - KDSE
Decision tree construction in DATASIFT • construct training and test set using rules training_set(P,Case_list) ... test_tuple(ID,F1,...,F20,Rec,Act_rec,CAR) ... • construct classifier using external call to C5.0 tree_rules(Tree_name,P,PF,MC,BO,Rule_list) training_set(P,Case_list),tree_induction(Case_list,PF,MC,BO,Rule_list). • parameters • pruning factor PF • misclassification costs MC • boosting BO external call induced classifier EDBT2000 tutorial - KDSE
Putting decision trees at work • prediction of target variable prediction(Tree_name,ID,CAR,Predicted_CAR) tree_rules(Tree_name, _ ,_ , _ , Rule_list),test_subject(ID, F1, …, F20, _, _, CAR),classify(Rule_list ,[F1, …, F20], Predicted_CAR). • Model evaluation: actual recovery of a classifier (=sum recovery of tuples classified as positive) actual_recovery(Tree_name,sum<Actual_Recovery>) prediction(Tree_name, ID, _ , pos),test_subject(ID, F1, …, F20, _,Actual_Recovery, _). aggregate EDBT2000 tutorial - KDSE
Combining decision trees • Model conjunction: tree_conjunction(T1,T2,ID,CAR,pos) prediction(T1, ID, CAR, pos),prediction(T2, ID, CAR, pos). tree_conjunction (T1, T2, ID, CAR, neg) test_subject(ID, F1, …, F20, _, _, CAR), ~ tree_conjunction(T1, T2, ID, CAR, pos). • More interesting combinations readily expressible: • e.g. meta learning (Chan and Stolfo 93) EDBT2000 tutorial - KDSE
We proposed ... • a KDD methodology for audit planning: • define an audit cost model • monitor training- and test-set construction • assess the quality of a classifier • tune classifier construction to specific policies • and its formalization in a prototype logic-based KDSE, supporting: • integration of deduction and induction • integration of domain and induced knowledge • separation of conceptual and implementation level EDBT2000 tutorial - KDSE
Module outline • Data analysis and KD Support Environments • Data mining technology trends • from tools … • … to suites … • … to solutions • Towards data mining query languages • DATASIFT: a logic-based KDSE • Future research challenges EDBT2000 tutorial - KDSE
A data mining research agenda • Integration with data warehouse and relational DB • Scalable, parallel/distributed and incremental mining • Data mining query language optimization • Multiple, integrated data mining methods • KDSE and methodological support for vertical appl. • Interactive, exploratory data mining environments • Mining on other forms of data: • spatio-temporal databases • text • multimedia • web EDBT2000 tutorial - KDSE