DATA MINING Prof. Navneet Goyal BITS, Pilani

DATA MININGProf. Navneet GoyalBITS, Pilani

1960s & Earlier Data Collection & Database Creation Primitive File Processing 1970s-early 1980s DBMSs Hierarchical & Network DBS RDBMS Data Modeling Tools (ER Model) Indexing Techniques Query languages: SQL User Interfaces: Froms & Reports Query Processing & Optimization Transaction Management: Concurrency & Recovery OLTP Mid 1980s-present Advanced DBS Advanced Data Models Extended Relational Object-oriented Object-relational Deductive Application Oriented Spatial Temporal Multimedia Late 1980s-present Data Warehousing & Data Mining DW & OLAP Technology DM & KDD 1990s – present Web-based DBS XML based Databases Web Mining 2000- ………. New Generation of Integrated Information Systems Evolution of Database Technology

Motivation • Why study Data Mining?

Tsunami of Data

“There is a tsunami of data that is crashing onto the beaches of the civilized world. This is a tidal wave of unrelated, growing data formed in bits and bytes, coming in an unorganized, uncontrolled, incoherent cacophony of foam. It's filled with flotsam and jetsam. It's filled with the sticks and bones and shells of inanimate and animate life. None of it is easily related, none of it comes with any organizational methodology. ...The tsunami is a wall of data -- data produced at greater and greater speed, greater and greater amounts to store in memory, amounts that double, it seems, with each sunset. On tape, on disks, on paper, sent by streams of light. Faster and faster, more and more and more.” Richard Saul Wurman, Information Architects Tsunami of Data

In 2005, mankind created 150 exabytes of data In 2010, it will create 1200 exabytes* * 2008 study by International Data Corp. (IDC) Tsunami of Data

Global Cloud Resolving Model (GCRM) @CSU 30 TB/night: Large Synoptic Survey (LSS) Telescope (2014) 15 PB/year: CERN’s LHC (May 2008) 1 PB over 3 years: EOS (Earth Observing System) data (2001) Tsunamis of Data 2 km, 100 levels, hourly data ~4 TB / simulated hour ~100 TB / simulated day ~35 PB / simulated year • 4 km, 100 levels, hourly data • ~1 TB / simulated hour • ~24 TB / simulated day • ~9 PB / simulated year

Telecom data ( 4.6 bn mobile subscribers) There are 3 Billion Telephone Calls in US each day, 30 Billion emails daily, 1 Billion SMS, IMs. IP Network Traffic: up to 1 Billion packets per hour per router. Each ISP has many (hundreds) routers! Weblog data (160 mn websites) Tsunami of Data

No. of pics on Facebook 15 bn unique photos 60 bn photos stored (4 sizes) Imageshack (20 bn) Photobucket (7.2 bn) Flickr (3.4 bn) Multiply (3 bn) Tsunami of Data

The Data Deluge 25th Feb. 2010, The Economist The Data Singularity is here! 08th Mar. 2010, Dataspora Blog The Data Singularity Part II: Human-sizing big data 27th May. 2010, Dataspora Blog Recent Articles

My definition of Data Mining “Data Mining is a family of techniques that transforms raw data into actionable information/knowledge” Data Mining

Data Mining

Data Mining has two perspectives: Data (algorithms) Domain (applications) One person having both these perspective: Very unlikely!! Domain experts should know what is possible with Data Mining Data miners seek problems from domain experts Modeling perspective: requires involvement of both data mining & domain experts Motivation

Intrusion Detection Systems Spam mail filtering Data Recovery Web personalization Adaptive Websites Information Retrieval Data Cleaning Information Retrieval Possibilities

Agriculture Precision farming Predicting crop yield Terrorism Prevention Retail CRM Fraud detection Tax cheats Credit card abuse Predicting TRPs Bioinformatics Health Care Civil Engineering* Possibilities

What is NOT Data Mining? • Originally a “statistician” term Overusing of data to draw invalid inferences • Bonferroni's theorem warns us that if there are too many possible conclusions to draw, some will be true for purely statistical reasons, with no physical validity. • Famous example: David Rhine, a “parapsychologist" at Duke in the 1950's tested students for extrasensory perception" by asking them to guess 10 cards - red or black. He found about 1/1000 of them guessed all 10, and instead of realizing that is what you'd expect from random guessing, declared them to have ESP. When he retested them, he found they did no better than average. His conclusion: “telling people they have ESP causes them to lose it”

What is Data Mining? • Discovery of useful summaries of data - Ullman • Extracting or “Mining” knowledge form large amounts of data • The efficient discovery of previously unknown patterns in large databases • Technology which predict future trends based on historical data • It helps businesses make proactive and knowledge-driven decisions • Data Mining vs. KDD • The name “Data Mining” a misnomer?

Data Mining Data mining is ready for application in the business & scientific community because it is supported by three technologies that are now sufficiently mature: • Massive data collection • Powerful multiprocessor computers • Data mining algorithms

Data Mining Applications Some examples of “successes": 1. Decision trees constructed from bank-loan histories to produce algorithms to decide whether to grant a loan. 2. Patterns of traveler behavior mined to manage the sale of discounted seats on planes, rooms in hotels,etc. 3. “Diapers and beer." Observation that customers who buy diapers are more likely to by beer than average allowed supermarkets to place beer and diapers nearby, knowing many customers would walk between them. Placing potato chips between increased sales of all three items. 4. Skycat and Sloan Sky Survey: clustering sky objects by their radiation levels in different bands allowed astronomers to distinguish between galaxies, nearby stars, and many other kinds of celestial objects. 5. Comparison of the genotype of people with/without a condition allowed the discovery of a set of genes that together account for many cases of diabetes. This sort of mining has become much more important as the human genome has fully been decoded

Data Mining Communities Several different communities have laid claim to DM 1. Statistics. 2. AI, where it is called “machine learning." 3. Researchers in clustering algorithms. 4. Visualization researchers. 5. Databases. We'll be taking this approach, of course, concentrating on the challenges that appear when the data is large and the computations complex. In a sense, data mining can be thought of as algorithms for executing very complex queries on non-main-memory data.

Data Mining

Stages of Data Mining Process 1. Data gathering, e.g., data warehousing. 2. Data cleansing: eliminate errors and/or bogus data, e.g., patient fever = 125. 3. Feature extraction: obtaining only the interesting attributes of the data, e.g., “date acquired” is probably not useful for clustering celestial objects, as in Skycat. 4. Pattern extraction and discovery. This is the stage that is often thought of as “data mining” and is where we shall concentrate our effort. 5. Visualization of the data. 6. Evaluation of results; not every discovered fact is useful, or even true! Judgment is necessary before following your software's conclusions.

Data Mining • Many different algorithms for performing many different tasks • DM algorithms can be characterized as consisting of 3 parts: • Model • Preference • Search • Model could be • Predictive • Descriptive

Data Mining

Predictive Model • Making prediction about values of data using known results from different data • Example: Credit Card Company • Every purchase is placed in 1 of 4 classes • Authorize • Ask for further identification before authorizing • Do not authorize • Do not authorize but contact police Two functions of Data Mining • Examine historical data to determine how the data fit into 4 classes • Apply the model to each new purchase

Descriptive Model Identifies patterns or relationship in data Example: Later

Two Important Terms • Supervised Learning • Training Data Set • Model is told to which class each training data belongs • Learning by example • Example CLASSIFICATION • Similar to Discriminate Analysis in Statistics • Unsupervised Learning • Class-label of training set is not known • No. of classes also may not be known • Learning by observation • Example CLUSTERING

Examples of Discovered Patterns • Association rules • 98% of people who purchase diapers also buy beer • Classification • People with age less than 25 and salary > 40k drive sports cars • Similar time sequences • Stocks of companies A and B perform similarly • Outlier Detection • Residential customers for telecom company with businesses at home

Association Rules & Frequent Itemsets • Market-Basket Analysis • Grocery Store: Large no. of ITEMS • Customers fill their market baskets with subset of items • 98% of people who purchase diapers also buy beer • Used for shelf management • Used for deciding whether an item should be put on sale • Other interesting applications • Basket=documents, Items=words Words appearing frequently together in documents may represent phrases or linked concepts. Can be used for intelligence gathering. • Basket=sentences, Items=documents Two documents with many of the same sentences could represent plagiarism or mirror sites on the Web.

Classification • Customer’s name, age income_level and credit _rating known • Training Set • Use classification algorithm to come up with classification rules • If age between 31 & 40 and income_level= ‘High’, then credit_rating = ‘Excellent’ • New Data(customer): Sachin, age=31, income_level=‘High’ implies credit_rating=‘Excellent’ • Classifier Accuracy? • Hold-out, k-fold cross validation • Prediction vs Classification

Clustering • Given points in some space, often a high-dimensional space • Group the points into a small number of clusters • Each cluster consisting of points that are “near” in some sense • Points in the same cluster are “similar” and are “dissimilar” to points in other clusters

Clustering: Examples • Cholera outbreak in London • Skycat clustered 2x109 sky objects into stars, galaxies, quasars, etc. Each object was a point in a space of 7 dimensions, with each dimension representing radiation in one band of the spectrum. • The Sloan Sky Survey is a more ambitious attempt to catalog and cluster the entire visible universe

Association Rules • Purchasing of one product when another product is purchased represents an AR • Used mainly in retail stores to • Assist in marketing • Shelf management • Inventory control • Faults in Telecommunication Networks • Transaction Database • Item-sets, Frequent or large item-sets • Support & Confidence of AR

Association Rules • A rule must have some minimum user-specified confidence 1 & 2 => 3 has 90% confidence if when a customer bought 1 and 2, in 90% of cases, the customer also bought 3. • A rule must have some minimum user-specified support 1 & 2 => 3 should hold in some minimum percentage of transactions to have business value • AR X => Y holds with confidence T, if T% of transactions in DB that support X also support Y

Types of Association Rules • Boolean/Quantitative ARs Based on type of values handled Bread  Butter age(X, “30….39”) & income(X, “42K…48K”)  buys(X, Projection TV) • Single/Multi-Dimensional Ars Based on dimensions of data involved buys(X,Bread)  buys(X,Butter) • Single/Multi-Level ARs Based on levels of Abstractions involved age(X, “30….39”)  buys(X, laptop) age(X, “30….39”)  buys(X, computer)

Example • Transaction Database • For minimum support = 50%, minimum confidence = 50%, we have the following rules 1 => 3 with 50% support and 66% confidence 3 => 1 with 50% support and 100% confidence

Support & Confidence I=Set of all items D=Transaction Database AR A=>B has support s if s is the %age of Txs in D that contain AUB s(A=>B )=P(AUB) AR A=>B has confidence c in D if c is the %age of Txs in D containing A that also contain B c(A=>B)=P (B/A)=P(AUB)/P(A)

Mining Association Rules 2 Step Process • Find all frequent Itemsets is all itemsets satisfying min_sup • Generate strong ARs from frequent itemsets ie Ars satisfying min_sup & min_conf

Frequent Itemsets (FIs) Algorithms for finding FIs • Apriori • Sampling • Partitioning

Apriori Algorithm (Boolean ARs) Candidate Generation • Level-wise search Frequent 1-itemset (L1) is found Frequent 2-itemset (L2) is found & so on… Until no more Frequent k-itemsets (Lk) can be found Finding each Lk requires one pass • Apriori Property “All nonempty subsets of a FI must also be frequent” P(I) < min_sup  P(I U A) < min_sup, where A is any item “Any subset of a FI must be frequent” • Anti-Monotone Property “If a set cannot pass a test, all its supersets will fail the test as well” Property is monotonic in the context of failing a test

DATA MINING Prof. Navneet Goyal BITS, Pilani