Overview of Data Mining and the KDD Process

Overview of Data Mining and the KDD Process Bamshad Mobasher DePaul University

From Data to Wisdom • Data • The raw material of information • Information • Data organized and presented by someone • Knowledge • Information read, heard or seen and understood and integrated • Wisdom • Distilled knowledge and understanding which can lead to decisions Wisdom Knowledge Information Data The Information Hierarchy

Why Data Mining? • The Explosive Growth of Data: from terabytes to petabytes • Data collection and data availability • Automated data collection tools, database systems, Web, computerized society • Major sources of abundant data • Business: Web, e-commerce, transactions, stocks, … • Science: Remote sensing, bioinformatics, scientific simulation, … • Society and everyone: news, images, video, documents • Internet …

Source: Intel

How much data? 640Kought to be enough for anybody. Google: ~20-30 PB a day Wayback Machine has ~4 PB + 100-200 TB/month Facebook: ~3 PB of user data + 25 TB/day eBay: ~7 PB of user data + 50 TB/day CERN’s Large Hydron Collider generates 15 PB a year In 2010, enterprises stored 7 Exabytes = 7,000,000,000 GB

Big Data Growing IDC predicts: From 2005 to 2020, the digital universe will double every 2 years and grow from 130 exabytes to 40,000 exabytes or 5,200 GB / person in 2020. The Untapped Data Gap: Most of the useful data will not be tagged or analyzed – partly due to skill shortage

What Is Data Mining? • We are drowning in data, but starving for knowledge! • “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets • Data Mining: A Definition The non-trivial extraction of implicit, previously unknown and potentially useful knowledge from data in large data repositories • Non-trivial: obvious knowledge is not useful • implicit: hidden difficult to observe knowledge • previously unknown • potentially useful: actionable; easy to understand

Data Mining: Confluence of Multiple Disciplines Machine Learning Pattern Recognition Statistics Data Mining Visualization Applications Algorithm Database Technology High-Performance Computing

Data Mining’s Virtuous Cycle Identifying the problem Mining data to transform it into actionable information Acting on the information Measuring the results 9

The Knowledge Discovery Process • Data Mining v. Knowledge Discovery in Databases (KDD) • DM and KDD are often used interchangeably • actually, DM is only part of the KDD process - The KDD Process

KDD Process in More Detail • Learning the application domain • Translate the business problem into a data mining problem • Gathering and integrating of data • Cleaning and preprocessing data • may be the most resource intensive part • Reducing and selecting data • find useful features, dimensionality reduction, etc. • Choosing functions of data mining • summarization, classification, regression, association, clustering, etc. • Choosing the mining algorithm(s) • Data mining: discover patterns of interest; construct models • Evaluate patterns and models • Interpretation: analysis of results • visualization, alteration, removing redundant patterns, querying • Use of discovered knowledge

Data Mining in Business Intelligence Increasing potential to support business decisions End User DecisionMaking Business Analyst Data Presentation Visualization Techniques Data Mining Data Analyst Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses DBA Data Sources paper, files, Web documents, customer data, experiments, databases

What Can Data Mining Do • Two kinds of knowledge discovery: directed and undirected • Directed Knowledge Discovery (Supervised Learning) • Purpose: Explain value of some field in terms of all the others (goal-oriented) • Method: select the target field based on some hypothesis about the data; ask the algorithm to tell us how to predict or classify new instances • Examples: • what products show increased sale when cream cheese is discounted • which banner ad to use on a web page for a given user coming to the site • Undirected Knowledge Discovery (Unsupervised Learning) • Purpose: Find patterns in the data that may be interesting (no target field) • Method: clustering, affinity grouping • Examples: • which products in the catalog often sell together • market segmentation (groups of customers/users with similar characteristics)

What Can Data Mining Do • Many Data Mining Tasks • often inter-related • often need to try different techniques for each task • each tasks may require different types of knowledge discovery • What are some of data mining tasks • Classification • Estimation or Prediction • Characterization • Discrimination • Affinity Grouping (Association Discovery) • Clustering • Sequence Analysis • Description

Some Applications of Data mining • Business data analysis and decision support • Marketing focalization • Recognizing specific market segments that respond to particular characteristics • Return on mailing campaign (target marketing) • Targeting banner ads • Customer Profiling • Segmentation of customer for marketing strategies and/or product offerings • Customer behavior understanding • Customer retention and loyalty • Mass customization / personalization / recommendation

Some Applications of Data mining • Business data analysis and decision support (cont.) • Market analysis and management • Provide summary information for decision-making • Market basket analysis, cross selling, market segmentation. • Resource planning • Risk analysis and management • "What if" analysis • Forecasting • Pricing analysis, competitive analysis • Time-series analysis (Ex. stock market)

Some Applications of Data mining • Fraud detection • Detecting telephone fraud: • Telephone call model: destination of the call, duration, time of day or week • Analyze patterns that deviate from an expected norm • British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud scheme • Detection of credit-card fraud • Detecting suspicious money transactions (money laundering) • Text mining: • Message filtering (e-mail, newsgroups, etc.) • Newspaper articles analysis • Text and document categorization • Web Mining . . .

Types of Web Mining Web Mining Web Usage Mining Web Structure Mining Web Content Mining

Types of Web Mining Web Mining Web Usage Mining Web Structure Mining Web Content Mining • Applications: • document clustering or categorization • topic identification / tracking • concept discovery • focused crawling • content-based personalization • intelligent search tools

Types of Web Mining Web Mining Web Usage Mining Web Structure Mining Web Content Mining • Applications: • user and customer behavior modeling • Web site optimization • e-customer relationship management • Web marketing • targeted advertising • recommender systems

Types of Web Mining Web Mining Web Usage Mining Web Structure Mining Web Content Mining • Applications: • document retrieval and ranking (e.g., Google) • discovery of “hubs” and “authorities” • discovery of Web communities • social network analysis

Data Mining: On What Kinds of Data? • Database-oriented data sets and applications • Relational database, data warehouse, transactional database • Object-relational databases, Heterogeneous databases and legacy databases • Advanced data sets and advanced applications • Data streams and sensor data • Time-series data, temporal data, sequence data (incl. bio-sequences) • Structure data, graphs, social networks and information networks • Spatial data and spatiotemporal data • Multimedia database • Text databases • The World-Wide Web

Data Mining: What Kind of Data? • Structured Databases • relational, object-relational, etc. • can use SQL to perform parts of the process e.g., SELECT count(*) FROM Items WHERE type=video GROUP BY category

Data Mining: What Kind of Data? • Flat Files • most common data source • can be text (or HTML) or binary • may contain transactions, statistical data, measurements, etc. • Transactional databases • set of records each with a transaction id, time stamp, and a set of items • may have an associated “description” file for the items • typical source of data used in market basket analysis

Data Mining: What Kind of Data? • Other Types of Databases • legacy databases • multimedia databases (usually very high-dimensional) • spatial databases (containing geographical information, such as maps, or satellite imaging data, etc.) • Time Series Temporal Data (time dependent information such as stock market data; usually very dynamic) • World Wide Web • basically a large, heterogeneous, distributed database • need for new or additional tools and techniques • information retrieval, filtering and extraction • agents to assist in browsing and filtering • Web content, usage, and structure (linkage) mining tools • The “social Web” • User generated meta-data, social networks, shared resources, etc.

DM Tasks A Closer Look: Classification • Classifying observations/instances into different “given” classes (i.e., classification is “supervised”) • example, classifying credit applicants as low, medium, or high risk • normally use a training set where all objects are already associated with known class labels • classification algorithm learns from the training set and builds a model • the model is used to classify new objects • Suitable data mining tools • Decision Tree algorithms (CHAD, C 5.0, and CART) • Memory-based reasoning • Neural Network • Example (Hypothetical Video Store) • build a model of users based their rental history (returning on-time, payments, etc.) • observe the current customer’s rental and payment history • decide whether should charge a deposit to current customer

DM Tasks A Closer Look: Estimation or Prediction • Same as classification • except classify according to some predicted or estimated future value • In estimation, historical data is used to build a (predictive) model that explains the current observed behavior • Model can then be applied to new instances to predict future behavior or forecast the future value of some missing attribute • Examples • predicting the size of balance transfer if the prospect accepts the offer • predicting the load on a Web server in a particular time period • Suitable data mining tools • Memory-based reasoning • Regression models • Decision Trees • Neural Networks

DM Tasks A Closer Look: Affinity Grouping • Determine what items often go together (usually in transactional databases) • Often Referred to as Market Basket Analysis • used in retail for planning arrangement on shelves • used for identifying cross-selling opportunities • “should” be used to determine best link structure for a Web site • Examples • people who buy milk and beer also tend to buy diapers • people who access pages A and B are likely to place an online order • Suitable data mining tools • association rule discovery • clustering • Nearest Neighbor analysis (memory-based reasoning)

DM Tasks A Closer Look: Clustering • Like classification, clustering is the organization of data into classes • however, class labels are unknown and it is up to the clustering algorithm to discover acceptable classes • also called unsupervised classification, because the classification is not dictated by given class labels • There are many clustering approaches • all based on the principle of maximizing the similarity between objects in a same class (intra-class similarity) and minimizing the similarity between objects of different classes (inter-class similarity) • Examples: • grouping similar users based on rating preferences on items • doing market segmentation of customers based on buying patterns and/or demographic attributes • grouping users of social network based on common relationships • identifying concepts by clustering news stories or blog posts

Characterization & Discrimination • Data characterization is a summarization of general features of objects in a target class • The data relevant to a target class are retrieved by a database query and run through a summarization module to extract the essence of the data at different levels of abstraction • example: characterize the video store’s customers who regularly rent more than 30 movies a year • Data discrimination is used to compare of the general features of objects between two classes • comparison relevant features of objects between a target class and a contrasting class • example: compare the general characteristics of the customers who rented more than 30 movies in the last year with those who rented less than 5 • The techniques used for data discrimination are very similar to the techniques used for data characterization with the exception that data discrimination results include comparative measures

Example: Moviegoer Database

moviegoers.name sex age source movies.name Amy f 27 Oberlin Independence Day Andrew m 25 Oberlin 12 Monkeys Andy m 34 Oberlin The Birdcage Anne f 30 Oberlin Trainspotting Ansje f 25 Oberlin I Shot Andy Warhol Beth f 30 Oberlin Chain Reaction Bob m 51 Pinewoods Schindler's List Brian m 23 Oberlin Super Cop Candy f 29 Oberlin Eddie Cara f 25 Oberlin Phenomenon Cathy f 39 Mt. Auburn The Birdcage Charles m 25 Oberlin Kingpin Curt m 30 MRJ T2 Judgment Day David m 40 MRJ Independence Day Erica f 23 Mt. Auburn Trainspotting Example: Moviegoer Database SELECT moviegoers.name, moviegoers.sex, moviegoers.age, sources.source, movies.name FROM movies, sources, moviegoers WHERE sources.source_ID = moviegoers.source_ID AND movies.movie_ID = moviegoers.movie_ID ORDER BY moviegoers.name;

Example: Moviegoer Database • Classification • determine sex based on age, source, and movies seen • determine source based on sex, age, and movies seen • determine most recent movie based on past movies, age, sex, and source • Prediction or Estimation • for predict, need a continuous variable (e.g., “age”) • predict age as a function of source, sex, and past movies • if we had a “rating” field for each moviegoer, we could predict the rating a new moviegoer gives to a movie based on age, sex, past movies, etc. • Clustering • find groupings of movies that are often seen by the same people • find groupings of people that tend to see the same movies • clustering might reveal relationships that are not necessarily recorded in the data (e.g., we may find a cluster that is dominated by people with young children; or a cluster of movies that correspond to a particular genre)

name TID Transaction Amy 001 {Independence Day, Trainspotting} Andrew 002 {12 Monkeys, The Birdcage, Trainspotting, Phenomenon} Andy 003 {Super Cop, Independence Day, Kingpin} Anne 004 {Trainspotting, Schindler's List} … … ... Example: Moviegoer Database • Affinity Grouping • market basket analysis (MBA): “which movies go together?” • need to create “transactions” for each moviegoer containing movies seen by that moviegoer: • may result in association rules such as: {“Phenomenon”, “The Birdcage”} ==> {“Trainspotting”} {“Trainspotting”, “The Birdcage”} ==> {sex = “f”} • Sequence Analysis • similar to MBA, but order in which items appear in the pattern is important • e.g., people who rent “The Birdcage” during a visit tend to rent “Trainspotting” in the next visit.

Major Issues in Data Mining (1) • Mining Methodology • Mining various and new kinds of knowledge • Mining knowledge in multi-dimensional space • Data mining: An interdisciplinary effort • Boosting the power of discovery in a networked environment • Handling noise, uncertainty, and incompleteness of data • Pattern evaluation and pattern- or constraint-guided mining • User Interaction • Interactive mining • Incorporation of background knowledge • Presentation and visualization of data mining results

Major Issues in Data Mining (2) • Efficiency and Scalability • Efficiency and scalability of data mining algorithms • Parallel, distributed, stream, and incremental mining methods • Diversity of data types • Handling complex types of data • Mining dynamic, networked, and global data repositories • Data mining and society • Social impacts of data mining • Privacy-preserving data mining • Invisible data mining

Overview of Data Mining and the KDD Process