M. Sulaiman Khan (mskhan@liv.ac.uk) ‏ Dept. of Computer Science University of Liverpool 2009

COMP527: Data Mining COMP527: Data Mining M. Sulaiman Khan • (mskhan@liv.ac.uk)‏ • Dept. of Computer Science • University of Liverpool • 2009 Introduction to Data Mining January 28, 2009 Slide 1

COMP527: Data Mining COMP527: Data Mining Introduction to the Course Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam Introduction to Data Mining January 28, 2009 Slide 2

Today's Topics COMP527: Data Mining • What is Data Mining? • Definitions • KDD: Knowledge Discovery in Databases • KDD Process • Differences with Statistics • Views on the Process • Basic Functions • Why would you do this? • Motivations • Applications • Summary Introduction to Data Mining January 28, 2009 Slide 3

What is Data Mining? COMP527: Data Mining • Some Definitions: • “The nontrivial extraction of implicit, previously unknown, and potentially useful information from data” (Piatetsky-Shapiro) • "...the automated or convenient extraction of patterns representing knowledge implicitly stored or captured in large databases, data warehouses, the Web, ... or data streams." (Han, pg xxi)‏ • “...the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful...” (Witten, pg 5)‏ • “...finding hidden information in a database.” (Dunham, pg 3)‏ • “...the process of employing one or more computer learning techniques to automatically analyse and extract knowledge from data contained within a database.” (Roiger, pg 4)‏ Introduction to Data Mining January 28, 2009 Slide 4

What is Data Mining? COMP527: Data Mining Keywords from each definition: • “The nontrivialextraction of implicit, previously unknown, and potentially useful informationfrom data” (Piatetsky-Shapiro)‏ • "...the automated or convenient extraction of patterns representing knowledge implicitly stored or captured in large databases, data warehouses, the Web, ... or data streams." (Han, pg xxi) • “...the process of discoveringpatterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful...” (Witten, pg 5)‏ • “...finding hidden informationin a database.” (Dunham, pg 3)‏ • “...the process of employing one or more computer learning techniques to automaticallyanalyze and extractknowledgefrom data contained within a database.” (Roiger, pg 4)‏ Introduction to Data Mining January 28, 2009 Slide 5

KDD: Knowledge Discovery in Databases COMP527: Data Mining Many texts treat KDD and Data Mining as the same process, but it is also possible to think of Data Mining as the discovery part of KDD. Dunham: KDD is the process of finding useful information and patterns in data.Data Mining is the use of algorithms to extract information and patterns derived by the KDD process. For this course, we will discuss the entire process (KDD) but focus mostly on the algorithms used for discovery. Introduction to Data Mining January 28, 2009 Slide 6

KDD: Knowledge Discovery in Databases COMP527: Data Mining KDD (Knowledge Discovery in Databases) is the nontrivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data. (Fayyad, Shapiro, & Smyth, CACM 96) • Or KDD : non-trivial extraction of implicit, previously unknown and potentially useful information • Data mining is just a part of the KDD process • Data mining applies algorithms to large data to produce models or patterns interesting to the user. Introduction to Data Mining January 28, 2009 Slide 7

The Data Mining (KDD) Process COMP527: Data Mining Introduction to Data Mining January 28, 2009 Slide 8

KDD Process Components COMP527: Data Mining • Operational Data • - Day-to-day data used to run business • Clean, collect and summarise • - Most Data is not suitable for data mining • - Errors or Noise, missing data, invalid formats • Data warehouse • - Mega store of clean (analysis) data • Data Preparation • - Validating the data for mining (e.g. remove noise, formatting, running validation routines etc.) • Training Data – Data used as test case for mining • Data Mining – the process of applying mining algorithms on data to produce interesting patterns Introduction to Data Mining January 28, 2009 Slide 9

Data Mining Algorithms scale to large data Data is used secondary for Data mining DM–tools use background knowledge for End-User Strategy : Exploration Cyclic Statistics Many algorithms with quadratic run time. Data is used for the Statistic (primary) Statistical background is often required Strategy: Conformational Verifying Few loops Differences with Statistics COMP527: Data Mining Introduction to Data Mining January 28, 2009 Slide 10

Piatetsky-Shapiro View COMP527: Data Mining Knowledge Interpretation Data Model Data Mining Transformed Data Transformation Preprocessed Data Preprocessing Target Data Selection Initial Data (As tweaked by Dunham)‏ Introduction to Data Mining January 28, 2009 Slide 11

CRISP-DM View COMP527: Data Mining Introduction to Data Mining January 28, 2009 Slide 12

Data Mining Functions COMP527: Data Mining All Data Mining functions can be thought of as attempting to find a model to fit the data. Each function needs Criteria to create one model over another. Each function needs a technique to Compare the data. Two types of model: • Predictive models predict unknown values based on known data • Descriptive models identify patterns in data Each type has several sub-categories, each of which has many algorithms. We won't have time to look at ALL of them in detail. Introduction to Data Mining January 28, 2009 Slide 13

Data Mining Functions COMP527: Data Mining Classification: Maps data into predefined classes Regression: Maps data into a function Prediction: Predict future data states Time Series Analysis: Analyze data over time (Supervised Learning)‏ Predictive Data Mining Clustering: Find groups of similar items Association Rules: Find relationships between items Characterisation: Derive representative information Sequence Discovery: Find sequential patterns (Unsupervised Learning)‏ Descriptive Introduction to Data Mining January 28, 2009 Slide 14

Classification COMP527: Data Mining The aim of classification is to create a model that can predict the 'type' or some category for a data instance that doesn't have one. Two phases: 1. Given labelled data instances, learn model for how to predict the class label for them. (Training)‏ 2. Given an unlabelled, unseen instance, use the model to predict the class label. (Prediction)‏ Some algorithms predict only a binary split (yes/no), some can predict 1 of N classes, some give probabilities for each of N classes. Introduction to Data Mining January 28, 2009 Slide 15

Clustering COMP527: Data Mining The aim of clustering is similar to classification, but without predefined classes. Clustering attempts to find clusters of data instances which are more similar to each other than to instances outside of the cluster. Unsupervised Learning: learning by observation, rather than by example. Some algorithms must be told how many clusters to find, others try to find an 'appropriate' number of clusters. Introduction to Data Mining January 28, 2009 Slide 16

Association Rule Mining COMP527: Data Mining The aim of association rule mining is to find patterns that occur in the data set frequently enough to be interesting. Hence the association or correlation of data attributes within instances, rather than between instances. These correlations are then expressed as rules – if X and Y appear in an instance, then Z also appears. Most algorithms are extensions of a single base algorithm known as 'A Priori', however a few others also exist. Introduction to Data Mining January 28, 2009 Slide 17

Why? COMP527: Data Mining That all sounds ... complicated. Why should I learn about Data Mining? What's wrong with just a relational database? Why would I want to go through these extra [complicated] steps? Isn't it expensive? It sounds like it takes a lot of skill, programming, computational time and storage space. Where's the benefit? Data Mining isn't just a cute academic exercise, it has very profitable real world uses. Practically all large companies and many governments perform data mining as part of their planning and analysis. Introduction to Data Mining January 28, 2009 Slide 18

Why Data Mining? Some general reasons COMP527: Data Mining • We are Data rich but knowledge poor • Computing affordable • - Storage, CPU, networking • Data is too large to analyse (Very Large Databases (VLBD) • - Dimensionality (size) • - distributed (location spread) • - heterogeneous (different types of data) • Traditional techniques infeasible • - Statistics, databases • Competitive pressure in business enterprises • - Customer profiling (Need to know who is a good customer) • - Business to Business (B2B – Being “old” is not profitable) Introduction to Data Mining January 28, 2009 Slide 19

Data is Everywhere! COMP527: Data Mining • Relational database—A commodity of every enterprise • Huge data warehouses are under construction • POS (Point of Sales): Transactional DBs in terabytes • Object, relational, distributed, heterogeneous and legacy databases • Spatial databases (GIS), remote sensing database (EOS), and scientific/engineering databases (Genetic data etc) • Time-series data (e.g., stock trading) and temporal data • Text (documents, emails) and multimedia databases • WWW:A huge, hyper-linked, dynamic, global information system (XML, Web content and Web usage data) • Crime data – terrorist data  more recent applications Introduction to Data Mining January 28, 2009 Slide 20

The Data Explosion COMP527: Data Mining The rate of data creation is accelerating each year. In 2003, UC Berkeley estimated that the previous year generated 5 exabytes of data, of which 92% was stored on electronically accessible media.Mega < Giga < Tera < Peta < Exa ... All the data in all the books in the US Library of Congress is ~136 Terabytes. So 37,000 New Libraries of Congress in 2002. VLBI Telescopes produce 16 Gigabytes of data every second. Each engine of each plane of each company produces ~1 Gigabyte of data every trans-atlantic length journey. Google searches 18 billion+ accessible web pages. Introduction to Data Mining January 28, 2009 Slide 21

Data Explosion Implications COMP527: Data Mining As the amount of data increases, the proportion of information decreases. As more and more data is generated automatically, we need to find automatic solutions to turn those stored raw results into information. Companies need to turn stored data into profit ... otherwise why are they storing it? Let's look at some real world examples. Introduction to Data Mining January 28, 2009 Slide 22

Classification COMP527: Data Mining The data generated by airplane engines can be used to determine when it needs to be serviced. By discovering the patterns that are indicative of problems, companies can service working engines less often (increasing profit) and discover faults before they materialise (increasing safety). Loan companies can “give you results in minutes” by classifying you into a good credit risk or a bad risk, based on your personal information and a large supply of previous, similar customers. Cell phone companies can classify customers into those likely to leave, and hence need enticement, and those that are likely to stay regardless. Introduction to Data Mining January 28, 2009 Slide 23

Clustering COMP527: Data Mining Discover previously unknown groups of customers/items. By finding clusters of customers, companies can then determine how best to handle that particular cluster. For example, this could be used for targeted advertising, special offers, transferring information gathered by association rule mining to other members of the cluster, and so forth. The concept of 'Similarity' is often used for determining other items that you might be interested in, eg 'More Like This' links. Introduction to Data Mining January 28, 2009 Slide 24

Association Rule Mining COMP527: Data Mining By finding association rules from shopping baskets, supermarkets can use this information for many things, including: • Product placement in the store • What to put on sale • What to create as 'joint special offers' • What to offer the customer in terms of coupons • What to advertise together It shouldn't be surprising that your Tesco coupons are for things that you sometimes buy, rather than things you always or never buy. Wal-Mart in the US records every transaction at every store -- petabytes of information to sift through. (TeraData)‏ Introduction to Data Mining January 28, 2009 Slide 25

Data/Information/Knowledge/Wisdom COMP527: Data Mining Note well that data mining applications have no wisdom. They cannot apply the knowledge that they discover appropriately. For example, a data mining application may tell you that there is a correlation between buying music magazines and beer, but it doesn't tell you how to use that knowledge. Should you put the two close together to reinforce the tendency, or should you put them far apart as people will buy them anyway and thus stay in the store longer? Data mining can help managers plan strategies for a company, it does not give them the strategies. Introduction to Data Mining January 28, 2009 Slide 26

Summary COMP527: Data Mining • What is data mining? • KDD - knowledge discovery in databases: nontrivial extraction of implicit, previously unknown and potentially useful information • Why do we need data mining? • - Very large data - data explosion, • - Dimensionality of data • - Heterogeneity of data • - Technology rich • - Traditional techniques infeasible Introduction to Data Mining January 28, 2009 Slide 27

M. Sulaiman Khan (mskhan@liv.ac.uk) ‏ Dept. of Computer Science University of Liverpool 2009