1 / 54

Data Mining and Data Warehousing By Dr. NK SAKTHIVEL, Professor/ SoC SASTRA

Data Mining and Data Warehousing By Dr. NK SAKTHIVEL, Professor/ SoC SASTRA. Evolution of Database Technology. 1960s and Earlier : Hierarch Model Data collection and database creation, and Primitive File Processing 1970s – Early 1980s: Data-Base Management Systems Network DBMS

fdiaz
Télécharger la présentation

Data Mining and Data Warehousing By Dr. NK SAKTHIVEL, Professor/ SoC SASTRA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining and Data Warehousing ByDr. NK SAKTHIVEL,Professor/ SoCSASTRA Data Mining: Concepts and Techniques

  2. Evolution of Database Technology • 1960s and Earlier : Hierarch Model • Data collection and database creation, and Primitive File Processing • 1970s – Early 1980s: Data-Base Management Systems • Network DBMS • Relational data model, relational DBMS implementation • Data Modeling Tool • Indexing, B++ Tree and Hashing • Query Languages • Transaction Management : Recovery, Concurrency Control • On-Line Transaction Processing OLTP Data Mining: Concepts and Techniques

  3. Evolution of Database Technology • Mid 1980s: • RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.) • 1990s—2000s: Data Analysis and Understanding • Data mining and data warehousing, multimedia databases, and Web databases Data Mining: Concepts and Techniques

  4. Data Mining: Concepts and Techniques

  5. Query Well defined SQL Query Poorly defined No precise query language Database Processing vs. Data Mining Processing • Data • Operational data • Data • Not operational data • Output • Precise • Subset of database • Output • Fuzzy • Not a subset of database Data Mining: Concepts and Techniques

  6. Query Examples • Database • Data Mining • Find all credit applicants with last name of Smith. • Identify customers who have purchased more than $10,000 in the last month. • Find all customers who have purchased milk • Find all credit applicants who are poor credit risks. (classification) • Identify customers with similar buying habits. (Clustering) • Find all items which are frequently purchased with milk. (association rules) Data Mining: Concepts and Techniques

  7. What Is Data Mining? • Data mining - knowledge discovery in databases • Extraction of interesting (implicit, previously unknown and potentially useful)information or patterns from data in large databases • Alternative names and their “inside stories”: • Data mining: a misnomer? • Knowledge Discovery(mining) in Databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. • What is not data mining? • (Deductive) query processing. • Expert systems or small ML/statistical programs Data Mining: Concepts and Techniques

  8. What Is Data Mining? • Gold Mining / Rock Mining / Sand Mining • Data Mining / Knowledge Mining Data Mining: Concepts and Techniques

  9. Why Data Mining? — Potential Applications • Database analysis and decision support • Market analysis and management • target marketing, customer relation management, market basket analysis, cross selling, market segmentation • Risk analysis and management • Forecasting, customer retention, improved underwriting, quality control, competitive analysis • Fraud detection and management • Other Applications • Text mining (news group, email, documents) and Web analysis. • Intelligent query answering Data Mining: Concepts and Techniques

  10. Market Analysis and Management (1) • Where are the data sources for analysis? • Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies • Target marketing • Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. • Determine customer purchasing patterns over time • Conversion of single to a joint bank account: marriage, etc. • Cross-market analysis • Associations/co-relations between product sales • Prediction based on the association information Data Mining: Concepts and Techniques

  11. Market Analysis and Management (2) • Customer profiling • data mining can tell you what types of customers buy what products (clustering or classification) • Identifying customer requirements • identifying the best products for different customers • use prediction to find what factors will attract new customers • Provides summary information • various multidimensional summary reports • statistical summary information (data central tendency and variation) Data Mining: Concepts and Techniques

  12. Corporate Analysis and Risk Management • Finance planning and asset evaluation • cash flow analysis and prediction • contingent claim analysis to evaluate assets • cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) • Resource planning: • summarize and compare the resources and spending • Competition: • monitor competitors and market directions • group customers into classes and a class-based pricing procedure • set pricing strategy in a highly competitive market Data Mining: Concepts and Techniques

  13. Fraud Detection and Management (1) • Applications • widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. • Approach • use historical data to build models of fraudulent behavior and use data mining to help identify similar instances • Examples • auto insurance: detect a group of people who stage accidents to collect on insurance • money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) • medical insurance: detect professional patients and ring of doctors and ring of references Data Mining: Concepts and Techniques

  14. Fraud Detection and Management (2) • Detecting inappropriate medical treatment • Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr). • Detecting telephone fraud • Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. • British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. • Retail • Analysts estimate that 38% of retail shrink is due to dishonest employees. Data Mining: Concepts and Techniques

  15. Other Applications • Sports • IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat • Astronomy • JPL and the Palomar Observatory discovered 22 quasars with the help of data mining • Internet Web Surf-Aid • IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc. Data Mining: Concepts and Techniques

  16. Data Mining: A KDD Process Knowledge Pattern Evaluation • Data mining: the core of knowledge discovery process. Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases Data Mining: Concepts and Techniques

  17. Steps of a KDD Process • Learning the application domain: • relevant prior knowledge and goals of application • Creating a target data set: data selection • Data cleaning and preprocessing: (may take 60% of effort!) • Data reduction and transformation: • Find useful features, dimensionality/variable reduction, invariant representation. • Choosing functions of data mining • summarization, classification, regression, association, clustering. • Choosing the mining algorithm(s) • Data mining: search for patterns of interest • Pattern evaluation and knowledge presentation • visualization, transformation, removing redundant patterns, etc. • Use of discovered knowledge Data Mining: Concepts and Techniques

  18. Data Mining and Business Intelligence End User Increasing potential to support business decisions Making Decisions Business Analyst Data Presentation Visualization Techniques Data Mining Data Analyst Information Discovery Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA DBA Data Sources Paper, Files, Information Providers, Database Systems, OLTP Data Mining: Concepts and Techniques

  19. Architecture of a Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server Filtering Data cleaning & data integration Data Warehouse Databases Data Mining: Concepts and Techniques

  20. Data Mining: On What Kind of Data? • Relational databases – Interrelated Operations • Data warehouses - Collecting information from different Dbases • Transactional databases – File / Table Transaction • Advanced DB Systems and their applications • Object-oriented and object-relational databases • Spatial databases - from satellite • Time-series data-base and – Sequence of time ( Market) • Temporal data-base - Information with Timestamp • Text databases and multimedia databases – Objects and Images • Heterogeneous and legacy databases – network databases • WWW Data Mining: Concepts and Techniques

  21. Data Mining Functionalities Perform Inference for Prediction General Properties Data Mining: Concepts and Techniques

  22. Data Mining Functionalities (1) • Concept description: Characterization and discrimination • Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions ( further precise/accurate is possible ) • Association (correlation and causality) • Link Analysis or Association • Multi-dimensional vs. single-dimensional association • X is a Customer and T is a Transaction • age(X, “20..29”) ^ income(X, “20..29K”) à buys(X, “PC”) [support = 2%, confidence = 60%] • Support means % of transaction and • Confidence is degree of certainty of the detected association • contains(T, “computer”) à contains(T, “software”) [1%, 50%] • Bread with Pretzels is 60% and Bread with Jelly is 70% Data Mining: Concepts and Techniques

  23. Data Mining Functionalities (2) • Classification and Prediction(Supervised ) • Finding Groups / classes / models (functions) that describe and distinguish classes or concepts for future prediction • E.g., classify countries based on climate, or classify cars based on mileage • Presentation: decision-tree, classification rule, neural network • Prediction: Predict some unknown or missing numerical values • Pattern Matching • Credit card company must determine to authorize the credit card purchase • i. Authorize ii. Ask for further identification before authorize • iii. Do not authorize iv. Do not authorize and contact police Data Mining: Concepts and Techniques

  24. Data Mining Functionalities (2) Cluster analysis(Unsupervised ) • Pattern Analysis, Data Analysis, Image Processing and Market Research • By clustering, one can identify dense and discover overall distribution patterns • Early in childhood, one learns how to distinguish between cats and dogs or animals and plants • Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns • Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity Data Mining: Concepts and Techniques

  25. Data Mining Functionalities (3) • Outlier analysis(used to identify error) • Outlier: a data object that does not comply (fulfill) with the general behavior of the data • For example, the program output as • Age : -18 • Salary of the chief executive officer of a company • Data Mining algorithms try to minimize the influence of outliers or eliminate them all together. • It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis Data Mining: Concepts and Techniques

  26. Data Mining Functionalities (3) • Evolution analysis • It describes models regularities or rends for objects, whose behavior changes over time • It includes Association, Classification or clustering of Time-Related data. However, distinct features are • Time-series data analysis (Stock Market Analysis for last several years to invest shares of high-tech industrial company) • Sequence or periodicity pattern matching • Similarity-based data analysis Data Mining: Concepts and Techniques

  27. Are All of the Patterns Interesting • Data Mining system can generate thousands of Patterns or rules • What make patterns interesting? • Can a data mining system generate all of the interesting patterns? • Can a data mining system generate only interesting patterns? Data Mining: Concepts and Techniques

  28. Are All of the Patterns Interesting • What make patterns interesting? • Interesting means knowledge • Hypothesis become confirmed • Support and confidence • Can a data mining system generate all of the interesting patterns? • Need completeness of the DMAlgorithm • Depends on association Rule Data Mining: Concepts and Techniques

  29. Mining Association Rules—An Example Min. support 50% Min. confidence 50% For rule A  C: support = support({A C}) = 50% confidence = support({A C})/support({A}) = 66.6% Data Mining: Concepts and Techniques

  30. Data Mining: Confluence of Multiple Disciplines Database Technology Statistics Data Mining Machine Learning Visualization Information Science Other Disciplines Fuzzy/Neural/GA Data Mining: Concepts and Techniques

  31. Data Mining Development • Similarity Measures • Hierarchical Clustering • IR Systems • Imprecise Queries • Textual Data • Web Search Engines • Relational Data Model • SQL • Association Rule Algorithms • Data Warehousing • Scalability Techniques • Bayes Theorem • Regression Analysis • EM Algorithm • K-Means Clustering • Time Series Analysis • Algorithm Design Techniques • Algorithm Analysis • Data Structures • Neural Networks • Decision Tree Algorithms Data Mining: Concepts and Techniques

  32. Categorized of Data Mining • Classification according to the kinds of Database mined • Classification according to the kinds of Knowledge Mined • Classification according to the kinds of Technique Utilized • Classification according to the kinds of Application Adapted Data Mining: Concepts and Techniques

  33. Data Mining: Classification Schemes • General functionality • Descriptive data mining • Predictive data mining • Different views, different classifications • Kinds of databases to be mined • Kinds of knowledge to be discovered • Kinds of techniques utilized • Kinds of applications adapted Data Mining: Concepts and Techniques

  34. Ex: Time Series Analysis • Example: Stock Market • Predict future values • Determine similar patterns over time • Classify behavior Data Mining: Concepts and Techniques

  35. Related Concepts Outline Goal: Examine some areas which are related to data mining. • Database/On Line Transaction Processing Systems • Fuzzy Sets and Logic • Information Retrieval(Web Search Engines) • On Line Analytic Processing /DSS • Statistics • Machine Learning • Pattern Matching Data Mining: Concepts and Techniques

  36. Major Issues in Data Mining • Regarding • Mining Methodology and User Interaction • Performance and • Diverse Data Types Data Mining: Concepts and Techniques

  37. Major Issues in Data Mining Mining Methodology and User Interaction • This is the kinds of Knowledge Mined • Different users have different knowledge • Hence, spectrum of data analysis and knowledge discovery tasks with Dmining functionalities • These tasks may use same database with different techniques • Which leads different discovery • The ability to mine knowledge • Interactive Mining with refining knowledge with OLAP • Incorporation of backward knowledge with deduction rule • The use of domain knowledge and Ad hoc mining • Which is used for efficient data mining • Knowledge Visualization like graphs, charts, curves Data Mining: Concepts and Techniques

  38. Major Issues in Data Mining Performance Issues • It includes Efficiency, Scalability and Parallelism of DM Algorithms • Efficiency and scalability of Data Mining Algorithm • DM Algorithms should be efficient and scalable • The running time to be predictable for large databases • Parallel, Distributed and Incremental DM Algorithms • Computational Complexity should be analyzed • Algorithm has to divide the data into partitions and then can proceed in parallel • The results can merge later • If need, the dbase can update (incremental) before computation Data Mining: Concepts and Techniques

  39. Major Issues in Data Mining Diversity of Database Types • Relational Database • Data Warehouse • Complex Data Objects • Hypertext and Multimedia Data • Spatial Data • Temporal Data and etc and • Internet having Heterogeneous DataBases • Hence, we need an efficient and effective data mining systems for different dbases, for different goals Data Mining: Concepts and Techniques

  40. Summary • Database Technology for DBMSystems • Data Mining for discovering interesting patterns • Knowledge Discovery Process includes data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation, and knowledge presentation • Data Warehouse organized way for decision making • Data Mining Functionalities Description, association, Classification, Prediction, clustering, trend analysis, deviation analysis, similarity analysis • Data Mining Systems Data Mining: Concepts and Techniques

  41. Exercises • What is data mining? In your answer, address the following: • Is it another hype? • Is it a simple transformation of technology developed from databases, statistics, and machine learning? • Explain how the evolution of database technology led to data mining. • Describe the steps involved in data mining when viewed as a process of knowledge discovery. Data Mining: Concepts and Techniques

  42. Exercises • What is data mining? In your answer, address the following: • Data mining refers the process or method that extracts or "mines" interesting knowledge or patterns from large amounts of data. • Is it another hype? • Data mining is not another hype. Instead, the need for data mining has arisen due to the wide availability of huge amounts of data and the imminent need for turning such data, into useful information and knowledge. Thus, data mining can be viewed as the result of the natural evolution of information technology. Data Mining: Concepts and Techniques

  43. Exercises • Is it a simple transformation of technology developed from databases, statistics, and machine learning? • No. Data mining is more than a simple transformation of technology developed from databases, statistics, and machine learning. • Instead, data mining involves an integration, rather than a simple transformation, of techniques from multiple, disciplines such as database technology, statistics, machine learning, high-performance computing, pattern recognition, neural networks, data visualization, information retrieval, image and signal processing, and spatial data analysis. Data Mining: Concepts and Techniques

  44. Exercises • Explain how the evolution of database technology led to data mining. • Database technology began with the development of data collection and database creation mechanisms that, led to the development of effective mechanisms for data management including data storage and retrieval, and query and transaction processing. The large number of database systems offering query and transaction processing eventually and naturally led to the need for data analysis and understanding. Hence, data mining began its development out of this necessity. Data Mining: Concepts and Techniques

  45. Exercises • Describe the steps involved in data mining when viewed as a process of knowledge discovery. • The steps involved in data mining when viewed as a process of knowledge discovery arc as follows: • Data cleaning, a process that removes or transforms noise and inconsistent data • Data integration, where multiple data sources may be combined • Data selection, where data relevant to the analysis task are retrieved from the database • Data transformation, where data are transformed or consolidated into forms appropriate for mining • Data mining, an essential process where intelligent and efficient methods are applied in order to extract, patterns • Pattern evaluation, a process that identifies the truly interesting patterns representing knowledge leased on some interestingness measures • Knowledge presentation, where visualization and knowledge representation techniques are used to present the mined knowledge to the user Data Mining: Concepts and Techniques

  46. Exercises • Distinguish between data warehouse and database • Applications of Data Mining • Architecture and KDD Process of Data Mining • Functionalities • Challenges and Issues Data Mining: Concepts and Techniques

  47. DATA WAREHOUSE • Distinguish between data warehouse and database • Applications of Data Mining • Architecture and KDD Process of Data Mining • Functionalities • Challenges and Issues Data Mining: Concepts and Techniques

  48. Data Mining: Concepts and Techniques

  49. Data Warehousing and OLAP Technology for Data Mining • What is a data warehouse? • A multi-dimensional data model • Data warehouse architecture • Data warehouse implementation • Further development of data cube technology • From data warehousing to data mining Data Mining: Concepts and Techniques

  50. What is Data Warehouse? • Defined in many different ways • A decision support database that is maintained separately from the organization’s operational database • Support information processing by providing a solid platform of consolidated, historical data for analysis. • “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon Data Mining: Concepts and Techniques

More Related