COP5725 Advanced Database Systems

COP5725Advanced Database Systems Data Mining Spring 2014

Why Data Mining? • The Explosive Growth of Data: from terabytes to petabytes • Data collection and data availability • Automated data collection tools, database systems, Web, computerized society • Major sources of abundant data • Business: Web, e-commerce, transactions, stocks, … • Science: Remote sensing, bioinformatics, scientific simulation, … • Society and everyone: news, digital cameras, YouTube • We are drowning in data, but starving for knowledge!

What is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data • Data mining: a misnomer? • Alternative names • Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. • Watch out: is everything “data mining”? • Simple search and query processing • (Deductive) expert systems

Knowledge Discovery (KDD) Process Knowledge • This is a view from typical database systems and data warehousing communities • Data mining plays an essential role in the knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases

Data integration Normalization Feature selection Dimension reduction Pattern evaluation Pattern selection Pattern interpretation Pattern visualization KDD Process: A Typical View from ML and Statistics Pattern Information Knowledge Data Mining Post-Processing Data Pre-Processing Input Data Pattern discovery Association & correlation Classification Clustering Outlier analysis … … … … • This is a view from typical machine learning and statistics communities

Data Mining in Business Intelligence Increasing potential to support business decisions End User DecisionMaking Business Analyst Data Presentation Visualization Techniques Data Mining Data Analyst Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses DBA Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems

Data Mining: Confluence of Multiple Disciplines Machine Learning Pattern Recognition Statistics Data Mining Visualization Applications Algorithm Database Technology High-Performance Computing

Data Mining: On What Kinds of Data? • Database-oriented data sets and applications • Relational database, data warehouse, transactional database • Advanced data sets and advanced applications • Data streams and sensor data • Time-series data, temporal data, sequence data • Structure data, graphs and social networks • Spatial data and spatiotemporal data • Multimedia database • Text databases • The World-Wide Web

Association and Correlation Analysis • Frequent pattern (or frequent itemsets) mining • What items are frequently purchased together in your Walmart shopping cart? • Association, correlation vs. causality • E.g., Diaper Beer [0.5%, 75%] (support, confidence) • Are strongly associated items also strongly correlated? • How to mine such patterns and rules efficiently in large datasets? • How to use such patterns for classification, clustering, and other applications?

Example

Classification • Classification and label prediction • Construct models (functions) based on some training examples • Describe and distinguish classes or concepts for future prediction • E.g., classify countries based on (climate), or classify cars based on (gas mileage) • Predict some unknown class labels • Typical methods • Decision trees, naïve Bayesian classification, SVM, neural networks, rule-based classification, pattern-based classification, logistic regression, … • Typical applications: • Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages, …

Example

Clustering & Outlier Analysis • Unsupervised learning (i.e., Class label is unknown) • Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns • Principle: Maximizing intra-class similarity & minimizing interclass similarity • Outlier Analysis • A data object that does not comply with the general behavior of the data • Noise or exception? ― One person’s garbage could be another person’s treasure

Example

Structure and Network Analysis • Graph mining • Finding frequent subgraphs (e.g., chemical compounds), trees (XML), substructures (web fragments) • Network analysis • Social networks: actors (objects, nodes) and relationships (edges) • e.g., author networks in CS, terrorist networks • Multiple heterogeneous networks • A person could be multiple information networks: friends, family, classmates, … • Links carry a lot of semantic information: Link mining • Web mining • Web is a big information network: from PageRank to Google • Web community discovery, opinion mining, usage mining, …

Example

Top 10 Data Mining Algorithms • C4.5: Decision-tree based classification • K-Means: Clustering • SVM: Classification and regression • Apriori: Frequent pattern mining • EM: MLE/MAP estimation, parameter estimation • PageRank: Link analysis and ranking • AdaBoost: Classification • kNN: Classification and regression • Naive Bayes: Classification • CART: Classification and regression

What Is Frequent Pattern Analysis? • Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set • Motivation: Finding inherent regularities in data • What products were often purchased together?— Beer and diapers?! • What are the subsequent purchases after buying a PC? • What kinds of DNA are sensitive to this new drug? • Can we automatically classify web documents? • Applications • Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.

Customer buys both Customer buys diaper Customer buys beer Basic Concepts: Frequent Patterns • itemset: A set of one or more items • k-itemset X = {x1, …, xk} • (absolute) support, or, support count of X: Frequency or occurrence of an itemset X • (relative)support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) • An itemset X is frequent if X’s support is no less than a minsupthreshold

Basic Concepts: Association Rules • Find all the rules X  Ywith minimum support and confidence • support, s, probability that a transaction contains X  Y • confidence, c,conditional probability that a transaction having X also contains Y Let minsup = 50%, minconf = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk Customer buys both Customer buys diaper Customer buys beer • Association rules: (many more!) • Beer  Diaper (60%, 100%) • Diaper  Beer (60%, 75%)

Closed Patterns and Max-Patterns • A long pattern contains a combinatorial number of sub-patterns • e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub-patterns! • Solution: Mine closed patterns and max-patterns instead • An itemset Xis closed • If X is frequent and there exists no super-pattern Y כ X, with the same support as X • An itemset X is a max-pattern • If X is frequent and there exists no frequent super-pattern Y כX • Closed pattern is a lossless compression of freq. patterns • Reducing the # of patterns and rules

Closed Patterns and Max-Patterns • Exercise. DB = {<a1, …, a100>, < a1, …, a50>} • Min_sup = 1. • What is the set of closed itemset? • <a1, …, a100>: 1 • < a1, …, a50>: 2 • What is the set of max-pattern? • <a1, …, a100>: 1 • What is the set of all patterns? • !!

Mining Frequent Itemsets • The downward closure property of frequent patterns • Any subset of a frequent itemset must be frequent • If {beer, diaper, nuts} is frequent, so is {beer, diaper} • i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! • A Candidate Generation & Test Approach • Initially, scan DB once to get frequent 1-itemset • Generate length (k+1) candidate itemsets from length k frequent itemsets • Test the candidates against DB • Terminate when no frequent or candidate set can be generated

The Apriori Algorithm—An Example Supmin = 2 Database L1 C1 1st scan C2 C2 L2 2nd scan L3 C3 3rd scan

The Apriori Algorithm Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; // candidate generation for each transaction t in database do // frequency counting increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end returnkLk;

Implementation of Apriori • How to generate candidates? • Step 1: self-joining Lk • Step 2: pruning • Example of Candidate-generation • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcdfrom abc and abd • acde from acd and ace • Pruning: • acde is removed because ade is not in L3 • C4 = {abcd}

Further Improvement of the Apriori Method • Major computational challenges • Multiple scans of transaction database • Huge number of candidates • Tedious workload of support counting for candidates • Improving Apriori: general ideas • Reduce passes of transaction database scans • Shrink number of candidates • Facilitate support counting of candidates

Partition: Scan Database Only Twice • Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB • Scan 1: partition database and find local frequent patterns • Scan 2: consolidate global frequent patterns DB1 + DB2 + + DBk = DB sup1(i) < σDB1 sup2(i) < σDB2 supk(i) < σDBk sup(i) < σDB

Finding Similar Patterns • Many data mining problems can be expressed as finding “similar” patterns: • Web pages with similar words, e.g., for classification by topic • NetFlix users with similar tastes in movies, for recommendation systems. • Dual: movies with similar sets of fans • Images of related things • The best techniques depend on whether you are looking for items that are somewhat similar • Shingling & Minhashing • Or very similar • Locality sensitive hashing

Example • Problem: comparing documents • Goal: common text, not common topic • Special case: identical documents, or one document contained character-by-character in another • General case: many small pieces of one doc appear out of order in another • Applications: • Mirror sites, or approximate mirrors: Don’t want to show both in a search • Plagiarism, including large quotations • Similar news articles at many news sites: cluster articles by “same story.”

Candidate pairs : those pairs of signatures that we need to test for similarity Locality- sensitive Hashing Minhash- ing Signatures : short integer vectors that represent the sets, and reflect their similarity Three Essential Techniques for Similar Pattern Discovery • Shingling : convert documents, emails, etc., to sets • Minhashing: convert large sets to short signatures, while preserving similarity • Locality-sensitive hashing: focus on pairs of signatures likely to be similar The set of Strings of length K that appear in the document Shingling Document

Shingles • A k-shingle(or k-gram) for a document is a sequence of k characters that appears in the document • Example: k=2; doc = abcab. Set of 2-shingles = {ab, bc, ca} • Option: regard shingles as a bag, and count abtwice • Represent a doc by its set of k-shingles • Assumption • Documents that have lots of shingles in common have similar text, even if the text appears in different order • You must pick k large enough, or most documents will have most shingles • k = 5 is OK for short documents; k = 10 is better for long documents

JaccardSimilarity • The Jaccard similarity of two sets is the size of their intersection divided by the size of their union. • Sim(C1, C2) = |C1C2|/|C1C2| • Boolean Matrices • Rows = elements of the universal set • Columns = sets • 1 in row e and column S if and only if e is a member of S • Column similarity is the Jaccard similarity of the sets of their rows with 1

* * * * * * * Example C1 C2 A 0 1 B 1 0 C 1 1Sim(C1, C2) = 2/5 = 0.4 D 0 0 E 1 1 F0 1

Signatures • Problem: • When the sets are so large or so many that they cannot fit in main memory • When there are so many sets that comparing all pairs of sets takes too much time. • Key idea: “hash” each column C to a small signatureSig (C), such that: 1. Sig (C) is small enough that we can fit a signature in main memory for each column • Sim (C1, C2) is the same as the “similarity” of Sig (C1) and Sig (C2)

MinHashing • Imagine the rows permuted randomly • Define “hash” function h (C ) = the number of the first (in the permuted order) row in which column C has 1 • Use several (e.g., 100) independent hash functions to create a signature • Surprising Property • The probability (over ALL permutations of the rows) that h (C1) = h (C2) is the same as Sim (C1, C2)

Example Matrix Signature Similarities

Locality Sensitive Hashing • Checking All Pairs is Hard • While the signatures of all columns may fit in main memory, comparing the signatures of all pairs of columns is quadratic in the number of columns • General idea • Use a function f(x,y) that tells whether or not x and y is a candidate pair • For minhash matrices: hash columns to many buckets, and make elements of the same bucket candidate pairs

Partition Into Bands r rows per band b bands One signature Matrix M

Partition into Bands • Divide matrix M into b bands of rrows • For each band, hash its portion of each column to a hash table with kbuckets • Make k as large as possible • Candidate column pairs are those that hash to the same bucket for ≥ 1 band • Tune b and r to catch most similar pairs, but few non-similar pairs

Columns 2 and 6 are probably identical. Columns 6 and 7 are surely different. Example Buckets b bands Matrix M r rows

Example • Suppose 100,000 columns, and signatures of 100 integers. Therefore, signatures take 40MB. We want all 80%-similar pairs • 5,000,000,000 pairs of signatures can take a while to compare • If choose 20 bands of 5 integers/band, and suppose C1, C2 are 80% similar • Probability C1, C2 identical in one particular band: (0.8)5 = 0.328 • Probability C1, C2 are notsimilar in any of the 20 bands: (1-0.328)20 = .00035 • i.e., about 1/3000th of the 80%-similar column pairs are false negatives

No bands identical At least one band identical 1 - ( 1 - sr )b t ~ (1/b)1/r Some row of a band unequal All rows of a band are equal Parameter Setting Probability of sharing a bucket t Similarity s of two sets

Example: b = 20; r = 5 t = (1/b)1/r =0.5493

COP5725 Advanced Database Systems