1 / 80

Issues in Data Mining Infrastructure

Issues in Data Mining Infrastructure. Authors: Nemanja Jovanovic, nemko@acm.org Valentina Milenkovic, tina@eunet.yu Voislav Galic, vgalic@bitsyu.net Dusan Zecevic, zdusan@titan.etf.bg.ac.yu Sonja Tica, ticaz@eunet.yu Prof. Dr. Dusan Tosic, dtosic@matf.bg.ac.yu

glenys
Télécharger la présentation

Issues in Data Mining Infrastructure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Issues in Data Mining Infrastructure Authors: Nemanja Jovanovic, nemko@acm.org Valentina Milenkovic, tina@eunet.yu Voislav Galic, vgalic@bitsyu.net Dusan Zecevic, zdusan@titan.etf.bg.ac.yu Sonja Tica, ticaz@eunet.yu Prof. Dr. Dusan Tosic, dtosic@matf.bg.ac.yu Prof. Dr. Veljko Milutinovic, vm@etf.bg.ac.yu

  2. Data Mining in the Nutshell • Uncovering the hidden knowledge • Huge n-p complete search space • Multidimensional interface NOTICE: All trademarks and service marks mentioned in this document are marks of their respective owners. Furthermore CRISP-DM consortium (NCR Systems Engineering Copenhagen (USA and Denmark), DaimlerChrysler AG (Germany), SPSS Inc. (USA) and OHRA Verzekeringen en Bank Groep B.V (The Netherlands)) permitted presentation of their process model.

  3. A Problem … You are a marketing manager for a cellular phone company • Problem: Churn is too high • Turnover (after contract expires) is 40% • Customers receive free phone (cost 125$) • You pay a sales commission of 250$ per contract • Giving a new telephone to everyone whose contract is expiring is expensive • Bringing back a customer after quitting is both difficult and expensive

  4. … A Solution • Three months before a contract expires, predict which customers will leave • If you want to keep a customer that is predicted to churn, offer them a new phone • The ones that are not predicted to churn need no attention • If you don’t want to keep the customer, do nothing • How can you predict future behavior? • Tarot Cards? • Magic Ball? • Data Mining?

  5. Still Skeptical?

  6. The Definition The automated extraction of predictive information from (large) databases • Automated • Extraction • Predictive • Databases

  7. History of Data Mining

  8. Repetition in Solar Activity • 1613 – Galileo Galilei • 1859 – Heinrich Schwabe

  9. The Return of theHalley Comet Edmund Halley (1656 - 1742) 1531 1607 1682 239 BC 1910 1986 2061 ???

  10. Data Mining is Not • Data warehousing • Ad-hoc query/reporting • Online Analytical Processing (OLAP) • Data visualization

  11. Data Mining is • Automated extraction of predictive informationfrom various data sources • Powerful technology with great potential to help users focus on the most important information stored in data warehouses or streamed through communication lines

  12. Data Mining can • Answer question that were too time consuming to resolve in the past • Predict future trends and behaviors, allowing us to make proactive, knowledge driven decision

  13. Data Mining Models

  14. Neural Networks • Characterizes processed data with single numeric value • Efficient modeling of large and complex problems • Based on biological structures - Neurons • Network consists of neurons grouped into layers

  15. Neuron Functionality W1 I1 W2 I2 Output W3 I3 f In Wn Output = f (W1*I1, W2*I1, …, Wn*In)

  16. Training Neural Networks

  17. Neural Networks • Once trained, Neural Networks can efficiently estimate the value of an output variable for given input • Neurons and network topology are essentials • Usually used for prediction or regression problem types • Difficult to understand • Data pre-processing often required

  18. Decision Trees • A way of representing a series of rules that lead to a class or value • Iterative splitting of data into discrete groups maximizing distance between them at each split • Classification trees and regression trees • Univariate splits and multivariate splits • Unlimited growth and stopping rules • CHAID, CHART, Quest, C5.0

  19. Decision Trees Balance>10 Balance<=10 Age<=32 Age>32 Married=NO Married=YES

  20. Decision Trees

  21. Rule Induction • Method of deriving a set of rules to classify cases • Creates independent rules that are unlikely to form a tree • Rules may not cover all possible situations • Rules may sometimes conflict in a prediction

  22. Rule Induction If balance>100.000 then confidence=HIGH & weight=1.7 If balance>25.000 and status=marriedthen confidence=HIGH & weight=2.3 If balance<40.000 then confidence=LOW & weight=1.9

  23. K-nearest Neighbor and Memory-Based Reasoning (MBR) • Usage of knowledge of previously solved similar problems in solving the new problem • Assigning the class to the group where most of the k-”neighbors” belong • First step – finding the suitable measure for distance between attributes in the data • + Easy handling of non-standard data types • - Huge models

  24. K-nearest Neighbor and Memory-Based Reasoning (MBR)

  25. Data Mining Algorithms • Many other available models and algorithms • Logistic regression • Discriminant analysis • Generalized Adaptive Models (GAM) • Genetic algorithms • The Apriori algorithm • Etc… • Many application specific variations of known models • Final implementation usually involves several techniques

  26. The Apriori Algorithm • The task – mining association rules by finding large itemsets and translating them to the corresponding association rules; • A  B, or A1A2… Am B1  B2… Bn, where A  B =  • The terminology • Confidence • Support • k-itemset – a set of k items; • Large itemsets – the large itemset {A, B} corresponds to the following rules (implications): A  B and B  A;

  27. The Apriori Algorithm • The  operator definition • n = 1: S2 = S1 S1 = {A}, {B}, {C}}  {{A}, {B}, {C}} = {{AB}, {AC}, {BC}} • n = k: Sk+1 = Sk Sk = {X  Y| X, Y  Sk, |X  Y| = k-1} • X and Y must have the same number of elements, and must have exactly k-1 identical elements; • Every k-element subset of any resulting set element (an element is actually a k+1 element set) has to belong to the original set of itemsets;

  28. The Apriori Algorithm • Example:

  29. The Apriori Algorithm • Step 1 – generate a candidate set of 1-itemsets C1 • Every possible 1-element set from the database is potentially a large itemset, because we don’t know the number of its appearances in the database in advance (á priori ); • The task adds up to identifying (counting) all the different elements in the database; every such element forms a 1-element candidate set; • C1 = {{A}, {B}, {C}, {D}, {E}} • Now, we are going to scan the entire database, to count the number of appearances for each one of these elements (i.e. one-element sets);

  30. The Apriori Algorithm • Now, we are going to scan the entire database, to count the number of appearances for each one of these elements (i.e. one-element sets);

  31. The Apriori Algorithm • Step 2 – generate a set of large 1-itemsets L1 • Each element in C1 with support that exceeds some adopted minimum support (for example 50%) becomes a member of L1; • L1 = {{A}, {B}, {C},{E}} and we can omit D in further steps (if D doesn’t have enough support alone, there is no way it could satisfy requested support in a combination with some other element(s));

  32. The Apriori Algorithm • Step 3 – generate a candidate set of large 2-itemsets, C2 • C2 = L1 L1 ={{AB}, {AC}, {AE}, {BC}, {BE}, {CE}} • Count the corresponding appearances • Step 4 – generate a set of large 2-itemsets, L2; • Eliminate the candidates without minimum support; • L2 = {{AC}, {BC}, {BE}, {CE}}

  33. The Apriori Algorithm • Step 5 (C3) • C3 = L2 L2 = {{BCE}} • Why not {ABC} and {ACE} – because their 2-element subsets {AB} and {AE} are not the elements of large 2-itemset set L2 (calculation is made according to the operator  definition); • Step 6 (L3) • L3 = {{BCE}}, since {BCE} satisfies the required support of 50% (two appearances); • There can be no further steps in this particular case, because L3 L3 = ; • Answer = L1  L2 L3;

  34. The Apriori Algorithm L1 = {large 1-itemsets} for (k=2; Lk-1 ; k++) Ck = apriori-gen(Lk-1); forall transactions t  D dobegin Ct = subset (Ck, t); forall candidates c  Ctdo c.count++; end; Lk = {c  Ck | c.count  minsup} end; Answer = k Lk

  35. The Apriori Algorithm • Enhancements to the basic algorithm • Scan-reduction • The most time consuming operation in Apriori algorithm is the database scan; it is originally performed after each candidate set generation, to determine the frequency of each candidate in the database; • Scan number reduction – counting candidates of multiple sizes in one pass; • Rather than counting only candidates of size k in the kth pass, we can also calculate the candidates C’k+1, where C’k+1 is generated from Ck (instead Lk), using the  operator;

  36. The Apriori Algorithm • Compare: C’k+1 = Ck Ck Ck+1 = Lk Lk • Note that C’k+1 Ck+1 • This variation can pay off in later passes, when the cost of counting and keeping in memory additional C’k+1 - Ck+1 candidates becomes less than the cost of scanning the database; • There has to be enough space in main memory for both Ck and C’k+1; • Following this idea, we can make further scan reduction: • C’k+1is calculated from Ckfor k > 1; • There must be enough memory space for all Ck’s (k > 1); • Consequently, only two database scans need to be performed (the first to determine L1, and the second to determine all the other Lk’s);

  37. The Apriori Algorithm • Abstraction levels • Higher level associations are stronger (more powerful), but also less certain; • A good practice would be adopting different thresholds for different abstraction levels (higher thresholds for higher levels of abstraction)

  38. References • Devedzic, V., “Inteligentni informacioni sistemi,” Digit, FON, Beograd, 2003. • http://www.marconi.com • http://www.blueyed.com • http://www.fipa.org • http://www.rpi.edu • http://research.microsoft.com • http://imatch.lcs.mit.edu

  39. DM Process Model • 5A – used by SPSS Clementine(Assess, Access, Analyze, Act and Automate) • SEMMA – used by SAS Enterprise Miner(Sample, Explore, Modify, Model and Assess) • CRISP – tends to become a standard

  40. CRISP - DM • CRoss-Industry Standard for DM • Conceived in 1996 by three companies:

  41. CRISP – DM methodology Four level breakdown of the CRISP-DM methodology: Phases Generic Tasks Specialized Tasks Process Instances

  42. Mapping generic modelsto specialized models • Analyze the specific context • Remove any details not applicable to the context • Add any details specific to the context • Specialize generic context according toconcrete characteristic of the context • Possibly rename generic contents to provide more explicit meanings

  43. CRISP – DM model • Business understanding Business understanding Data understanding • Data understanding • Data preparation • Modeling Datapreparation Deployment • Evaluation • Deployment Evaluation Modeling

  44. Business Understanding • Determine business objectives • Assess situation • Determine data mining goals • Produce project plan

  45. Data Understanding • Collect initial data • Describe data • Explore data • Verify data quality

  46. Data Preparation • Select data • Clean data • Construct data • Integrate data • Format data

  47. Modeling • Select modeling technique • Generate test design • Build model • Assess model

  48. Evaluation results = models + findings • Evaluate results • Review process • Determine next steps

  49. Deployment • Plan deployment • Plan monitoring and maintenance • Produce final report • Review project

  50. At Last…

More Related