Chapter I:Introduct ion BIS 541 20 13/2014 Summer

Chapter I:IntroductionBIS5412013/2014 Summer

Chapter 1. Introduction • Motivation: Why data mining? • Methodology of Knowledge Discovery in Databases • Data mining functionalities • Are all the patterns interesting? • Business applications of data mining

Motivation: “Necessity is the Mother of Invention” • Data explosion problem • Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories • Need to convert such data into knowledge and information • Applications • Business management • Production control • Market analysis • Engineering design • Science exploration

Evolution of Database Technology (1) • Data collection, database creation • Data management • data storage and retrieval • database transaction processing • Data analysis and understanding • Data mining and data warehousing

Evolution of Database Technology (2) • 1960s: • Data collection, database creation, IMS and network DBMS • 1970s: • Relational data model, relational DBMS implementation • 1980s: • RDBMS, advanced data models (extended-relational, OO, deductive, etc.) • Application-oriented DBMS (spatial, scientific, engineering, etc.) • 1990s: • Data mining, data warehousing, multimedia databases, and Web databases • 2000s • Stream data management and mining • Data mining and its applications • Web technology (XML, data integration) and global information systems

The Explosive Growth of Data: from terabytes to petabytes • Data collection and data availability • Automated data collection tools, database systems, Web, computerized society • Major sources of abundant data • Business: Web, e-commerce, transactions, stocks, … • Science: Remote sensing, bioinformatics, scientific simulation, … • Society and everyone: news, digital cameras, YouTube • We are drowning in data, but starving for knowledge! • “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets

Developments in computer hardware • Powerful and affordable computers • Data collection equipment • Storage media • Communication and networking

Data Warehouse • Data cleaning • Data integration • OLAP: On-Line Analytical Processing • summarization • consolidation • aggregation • view information from different angles • but additional data analysis tools are needed for • classification • clustering • charecterization of data changing over time

Data rich information poor situation • Abundance of data • need for powerful data analysis tools • “data tombs” - data archives • seldom visited • Important decisions are made • not on the information rich data stored in databases • but on a decision maker’s intuition • no tool to extract knowledge embedded in vast amounts of data • Expert system technology • domain experts to input knowledge • time consuming and costly

What Is Data Mining? • Data mining (knowledge discovery in databases): • Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful)information or patterns from data in large databases • Alternative names and their “inside stories”: • Data mining: a misnomer? • Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. • What is not data mining? • query processing. • Expert systems or small ML/statistical programs

Data Mining vs. Data Query • Data Query:e.g. • A list of all customers who use a credit card to buy a PC • A list of all MIS students having a GPA of 3.5 or higher and has studied 4 or less semesters • Data Mining problems:e.g. • What is the likelihood of a customer purchasing PC with credit card • Given the characteristics of MIS students predict her SPA in the comming term • What are the characteristics of MIS undergrad students

Why Data Mining? • Four questions to be answered • Can the problem clearly be defined? • Does potentially meaningful data exists? • Does the data contain hidden knowledge or useful only for reporting purposes? • Will the cost of processing the data will be less then the likely increase in profit from the knowledge gained from applying any data mining project

Steps of a KDD Process(1) • 1. Goal identification: • Define problem • relevant prior knowledge and goals of application • 2. Creating a target data set: data selection • 3. Data preprocessing: (may take 60%-80% of effort!) • removal of noise or outliers • strategies for handling missing data fields • accounting for time sequence information • 4. Data reduction and transformation: • Find useful features, dimensionality/variable reduction, invariant representation.

Steps of a KDD Process(2) • 5. Data Mining: • Choosing functions of data mining: • summarization, classification, regression, association, clustering. • Choosing the mining algorithm(s): • which models or parameters • Search for patterns of interest • 6. Presentationand Evaluation: • visualization, transformation, removing redundant patterns, etc. • 7. Taking action: • incorporating into the performance system • documenting • reporting to interested parties

An example: Customer Segmentation • 1. Marketing department wants to perform a segmentation study on the customers of AE Company • 2. Decide on revevant variables from a data warehouse on customers, sales, promotions • Customers: name,ID,income,age,education,... • Sales: hisory of sales • Promotion: promotion types durations... • 3. Hendle missing income, addresses.. • determine outliers if any • 4. Cenerate new index variables representing wealth of customers • Wealth = a*income+b*#houses+c*#cars... • Make neccesary transformations z scores so that some data mining algorithms work more efficiently

Example: Customer Segmentation cont. • 5.a: Choose clustering as the data mining functionality as it is the natural one for a segmentation study so as to find group of customers with similar charecteristics • 5.b: Choose a clustering algorithm • K-means or k-medoids or any suitable one for that problem • 5.c: Apply the algorithm • Find clusters or segments • 6. make reverse transformations, visualize the customer segments • 7. present the results in the form of a report to the marketing deprtment • İmplement the segmentation as part of a DSS so that it can be applied repeatedly at certain internvals as new customers arrive • Develop marketing strategies for each segment

Data Mining: A KDD Process Knowledge Pattern Evaluation • Data mining: the core of knowledge discovery process. Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases

Data Mining in Business Intelligence Increasing potential to support business decisions End User DecisionMaking Business Analyst Data Presentation Visualization Techniques Data Mining Data Analyst Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses DBA Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems September 3, 2014 19 Data Mining: Concepts and Techniques

Architecture of a Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server Filtering Data cleaning & data integration Data Warehouse Databases

Architecture of a Typical Data Mining System • Data base, data warehouse • Data base or data warehouse server • Knowledge base • concept hierarchies • user beliefs • asses pattern’s interestingness • other thresholds • Data mining engine • functional modules • characterization, association, classification, cluster analysis, evolution and deviation analysis • Pattern evaluation module • Graphical user interface

Data Mining: Confluence of Multiple Disciplines Database Technology Statistics Data Mining Machine Learning Visualization Information Science Other Disciplines

Why Confluence of Multiple Disciplines? Tremendous amount of data Algorithms must be highly scalable to handle such as tera-bytes of data High-dimensionality of data Micro-array may have tens of thousands of dimensions High complexity of data Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations New and sophisticated applications September 3, 2014 23 Data Mining: Concepts and Techniques

Efficient and Scalable Techniques • For an algorithm to be efficient and scalable • its running time should be predictable and acceptable • How • Parallel and distributed algorithms • Sampling from databases

Two Styles of Data Mining • Descriptive data mining • characterize the general properties of the data in the database • finds patterns in data and • the user determines which ones are important • Predictive data mining • perform inference on the current data to make predictions • we know what to predict • Not mutually exclusive • used together • Descriptive  predictive • Eg. Customer segmentation – descriptive by clustering • Followed by a risk assignment model – predictive by ANN

Supervised vs. Unsupervised Learning • Supervised learning (classification, prediction) • Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations • New data is classified based on the training set • Unsupervised learning(summarization. association, clustering) • The class labels of training data is unknown • Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

Descriptive Data Mining (1) • Discovering new patterns inside the data • Used during the data exploration steps • Typical questions answered by descriptive data mining • what is in the data • what does it look like • are there any unusual patterns • what dose the data suggest for customer segmentation • users may have no idea • which kind of patterns may be interesting

Descriptive Data Mining (2) • patterns at verious granularities • geograph • country - city - region - street • student • university - faculty - department - minor • Fuctionalities of descriptive data mining • Clustering • Ex: customer segmentation • summarization • visualization • Association • Ex: market basket analysis

A model is a black box X: vector of independent variables or inputs Y =f(X) : an unknown function Y: dependent variables or output a single variable or a vector Model Y output inputs X1,X2 The user does not care what the model is doing it is a black box interested in the accuracy of its predictions

Predictive Data Mining (1) • Using known examples the model is trained • the unknown function is learned from data • the more data with known outcomes is available • the better the predictive power of the model • Used to predict outcomes whose inputs are known but the output values are not realized yet • Never %100 accurate

Predictive Data Mining (2) • The performance of a model on past data is not important • to predict the known outcomes • Its performance on unknown data is much more important

Typical questions answered by predictive models • Who is likely to respond to our next offer • based on history of previous marketing campaigns • Which customers are likely to leave in the next six months • What transactions are likely to be fraudulent • based on known examples of fraud • What is the total amount spending of a customer in the next month

Data Mining Functionalities (1) • Concept description: Characterization and discrimination • Generalize, summarize, and contrast data characteristics, e.g., big spenders vs. budget spenders • Association (correlation and causality) • Multi-dimensional vs. single-dimensional association • age(X, “20..29”) ^ income(X, “20..29K”) à buys(X, “PC”) [support = 2%, confidence = 60%] • contains(T, “computer”) à contains(x, “software”) [1%, 75%]

Data Mining Functionalities (2) • Classification and Prediction • Finding models (functions) that describe and distinguish classes or concepts for future prediction • E.g., classify people as healty or sick, or classify transactions as fraudulent or not • Methods: decision-tree, classification rule, neural network • Prediction: Predict some unknown or missing numerical values • Cluster analysis • Class label is unknown: Group data to form new classes, e.g., cluster customers of a retail company to learn about characteristics of different segments • Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity

Data Mining Functionalities (3) • Outlier analysis • Outlier: a data object that does not comply with the general behavior of the data • It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis • Trend and evolution analysis • Trend and deviation: regression analysis • Sequential pattern mining: click stream analysis • Similarity-based analysis • Other pattern-directed or statistical analyses

Concept Description • Characterization • Discerimination • Data • classes or • concpets • classes of items for sale • computers, printers • concepts of customers: • bigSpenders • BudgetSpenders

Data Characterization • Summarization the data of the class under study (target class) • Methods • SQL queries • OLAP roll up -operation • user-controlled data summarization • along a specified dimension • attribute oriented induction • without step by step user interraction • the output of characterization • pie charts, bar chars, curves, multidimensional data cube, or cross tabs • in rule form as characteristic rules

Characterization example • Description summarizing the characteristics of customers who spend more than $1000 a year at AllElecronics • age, employment, income • drill down on any dimension • on occupation view these according to their type of employment

Data Discrimination • Comparing the target class with one or a set of comparative classes (contrasting classes) • these classes can be specified by the use • database queries • methods and output • similar to those used for characterization • include comparative measures to distinguish between the target and contrasting classes

Discrimination examples • Example 1:Compare the general features of software products • whose sales increased by %10 in the last year (target class) • whose sales decreased by at least %30 during the same period (contrasting class) • Example 2: Compare two groups of AE customers • I) who shop for computer products regularly (target class) • more than two times a month • II) who rarely shop for such products (contrasting class) • less than three times a year • The resulting description: • %80 of I group customers • university education • ages 20-40 • %60 of II group customers • seniors or young • no university degree

Multidimensional Data • sales according to region month and product type Dimensions: Product, Location, Time Hierarchical summarization paths Region Industry Region Year Category Country Quarter Product City Month Week Office Day Product Month

Association Analysis • Discovery of association rules showing attribute-value conditions that occur frequently together in a given set of data • widely used • market basket • transaction data analysis • more formally • X  Y that is • A1A2.. Ak B1B2.. Bl • A1 , B1 are attribute value pairs or predicates

Example: association analysis • From the AllEs database • age(X,”20..29”)income(X,”1,000...2,000”)buy(X,”Notebook computer”) • (support = %2, • confidence= %60) • X is a variable representing a customer • %2 of the AE customers are • between 20 and 29 age • incomes ranging from 1 to 2 billon TL • Buy Notebook • with %60 probability that customers in those age and income groups will buynote book • a multidimensional association rule • contains more than one attribute or predicate

Market basket analysis • customers buying behaviour is investigated • Based on only the transactions data • no information about customer properties: age income • Managers • are interested in which products or product groups are sold together

Transactional Database

Example: basket analysis rule • buy(notebok)buy(printer) • (support= %1,confidence=%60) • %1 of all transactions contains • computer and printer • if a transaction containsnotebook • there is a %60 chance that it contains printer as well • a single dimensional association rule • contains a single predicate • an association rule is interesting if • its support exceeds a minimum threshold and • its confidence exceeds a min threshold • These min values are set by specialists

Classification • Learning is supervised • Dependent variable is categorical • Build a model able to assign new instances to one of a set of well-defined classes

Typical Classification Problems • Given characteristics of individuals differentiate them who have suffered a heart attack from those who have not • Determine if a credit card purchase is fraudulent • Classify a car loan applicant as a good or a poor credit risk

Methods of Classification • Decision Trees • Artificial Neural Networks • Bayesian Classification • Naïve • Belief Networks • k-nearest neighbor • Regression • Logistic (logit) probit • Predicts probability of each class • when the dependent variable is categorical • good customer bed customer or employed unemployed

Chapter I:Introduct ion BIS 541 20 13/2014 Summer