Mining: A Database Perspective

Mining: A Database Perspective Raghu Ramakrishnan Univ. of Wisconsin-Madison

Data Mining ML/AI DB Stats Optimization • Classification • Decision trees • Regression • SVMs • Naïve Bayes • Meta-learners, ensembles • Clustering • K-means • Hierarchical methods • EM • MRDM/ILP pattern discovery • Horn rules; PRMs • Frequent item analysis • Associations, sequential patterns • Time-series analysis • Linear and nonlinear dynamics • Collaborative filtering • Text, multimedia mining

Mining at a Crossroads • Data Mining has drawn upon ideas and people from many disciplines, and has grown rapidly. • As yet, no unifying vision of how these disciplines leverage each other. • Stats folks still do stats, ML folks still do ML, DB folks still think about large datasets—and they rarely talk amongst each other. • What are the applications that will pay the piper?

A database perspective on data mining and its relationship to data management How can database-oriented thinking influence research and practice in data mining? What are the difficult problems with big payoffs? The EDAM project at Wisconsin Analyzing streams of mass spectra and other spatio-temporal data Joint work with researchers in atmospheric aerosols and climatology at UW-Madison and Carleton College, funded by an NSF ITR About this Talk

Outline • A Database perspective • Recent extensions to relational systems • OLAP: Cube, sequence queries • Data mining support • Relational approaches to mining • Relational clustering • MRDM/ILP • The EDAM project

A Database Perspective

All the World’s a Table • All data is in a database. • If not, it’s not important  • Data mining is a class of analysis techniques that complements current SQL data analysis capabilities. • Data is in a DBMS for reasons that go well beyond the analysis capabilities of the DBMS, even if these are often inadequate. • And if the past is any indication, the DB vendors will try to expand SQL to support whatever DM capabilities the market will pay for—and it’s not clear that this is the right architecture.

Scalability • Widely recognized as a characteristic DB concern, and that it provides useful techniques to deal with scale. • BIRCH—Scalable pre-clustering that borrows ideas from B+ trees • Rainforest—Framework for scaling decision tree construction that borrows from hash joins • (There are also scalable algorithms based on EM and Bootstrapping) • However, the focus has been on one aspect of scale: • Size of training data • We also need scalability with respect to other problem dimensions: • Size of hypothesis space • Rate of data capture and analysis

Queries vs. Mining • From the point of view of the user, SQL queries are one way to explore and understand the data. • But is it “data mining”? • The various data mining techniques are no more (or less) than alternatives with different capabilities. • The query framework has some ideas worth borrowing and generalizing: • Compositionality—more flexibility, more automation • Usability—domain analysts, not tool experts • Query Optimization

A Different Mindset … • Sometimes, just looking at the problem from a different perspective may lead to useful reformulations: • Frequent itemsets • Relational clustering • Stream analysis • Labeling spectra • Subset mining • “What does a query mean?” vs. “How do I characterize my data?” • Hopefully, not mutually exclusive! • Can raise very different concerns • E.g., Coverage, accuracy (ML), confidence bounds (Stats) vs. query equivalence, compositionality (DB) • Combining multiple sources of information (e.g., multiple tables)

Query Optimization • Driven by user’s query • Goal is to find answers to this query efficiently • Search space for optimization • Defined through equivalences to given query • Exploits compositionality! • “Goodness” metric is estimated plan cost • Contrast this with the search spaces typical in, e.g., rule discovery or attribute selection • These are data-driven, not query-driven • Search space based on hypothesis refinement • “Goodness” metric based on coverage of training set

Data Management • Management • Data storage and archival • Privacy, sharing, collaboration • Focus has been on managing data; however: • Queries can be stored in the DBMS • Views, or tables defined by queries • (Ownership, access control, re-optimization, caching) • We need more support for managing analyses: • Managing analyses external to the DBMS • Provenance of data and analysis • Versioning and collaboration support • Support for ongoing analyses: Impact of data changes on analyses; monitoring; trend analysis over warehouses; deploying results into operational system

Data Co-Processor Architecture Queries/Searches Periodic offline activity Miner Indexer Large R/W Small reads Files, Logs DBMS Warehouse RAID STORAGE

SQL Queries Updates OLAP Queries Text Queries SYNC CUSTOMIZED ASYNCHRONOUS REPLICAS

Recent Extensions of Relational Queries

Star Schema Transactions(timekey, storekey, pkey,promkey, ckey, units, price) Time Store Customers Products Promotions

Multidimensional Analysis Year Industry Country=“USA” Quarter Category State Month Week Product City Day

Slice and Drill-Down Year Industry=“Industry3” Country Quarter Category State=“CA” Month Week Product City Day

Comparison with SQL SELECT SUM(S.sales) FROM Sales S, Times T, Locations L WHERE S.timeid=T.timeid AND S.timeid=L.timeid GROUP BY T.year, L.city SELECT SUM(S.sales) FROM Sales S, Location L WHERE S.timeid=L.timeid GROUP BY L.city SELECT SUM(S.sales) FROM Sales S, Times T WHERE S.timeid=T.timeid GROUP BY T.year

Visual Intuition: Cube roll-up to category roll-up to state Location SH SF LA Product1 Product2Product3 Product4Product5Product6 20 30 20 15 10 50 Product roll-up to week M T W Th F S S Time 50 Units of Product6 sold on Monday in LA

CUBE Operator • For k dimensions, we have 2^k possible SQL GROUP BY queries that can be generated through pivoting on a subset of dimensions. • CUBE pid, locid, timeid BY SUM Sales • Equivalent to rolling up Sales on all eight subsets of the set {pid, locid, timeid}; each roll-up corresponds to an SQL query of the form: SELECT SUM(S.sales) FROM Sales S GROUP BY grouping-list

Observation • When you need to consider several related or overlapping computations • Think of how to expose this space to the user, and to get user input on what part of the space might be interesting • Marketing specialists can use OLAP interfaces to do very complex queries easily • Think of how to optimize by exploiting commonality across computations

Querying Sequences • SQL-92 supports queries over relations. • A relation is a (multi) set of records. • No ordering of records in a relation! • Queries involving order are hard or impossible to express, and typically, inefficiently evaluated. • Find weekly moving average of the DJIA. • Compute % change of each stock during ‘97, and then find stocks in the top 5% (those that changed most). • SQL:1999 supports the concept of windowing, which effectively orders tuples for query purposes.

SRQL(Ramakrishnan et al., SSDBM 98) • Proposed a sequencing operator as an extension to relational algebra. Applied to a table R, with grouping attrs g and sequencing attrs s, it returns the correspondingcomposite sequence.

Example SELECT product, day, AVG(vol) OVER 0 TO 1 FROM Sales GROUP BY product SEQUENCE BY day • Find the 2-day moving average of volume sold for each product: • In effect, creates a sequence by day for each product, and computes the moving average over each of these sequences. • Observe how this generalizes SQL’s GROUP BY: illustrates power of composite sequences and aggregation.

Variants of Aggregation • We can now introduce “running sum” and other cumulative aggregate functions! • OVER FIRST TO 0: This gives us “running” or “cumulative” aggregates. • RANK() is CUMULATIVE COUNT(*) • PERCENTILE() is (RANK()/COUNT(*))*100 • Elegant way to express concepts like “give me the first few answers”. SQL:1999 does all this and more (different syntax)

Observation • Still much more limited than time-series analysis and mining techniques available elsewhere • No support for streams

DBMS Support for Managing Mining Models

Why Integrate? Models Copy Mine Extract Data Consistency?

Avoid isolation of querying from mining Difficult to do “ad-hoc” mining Provide simple programming approach to creating and using DM models Make it possible to add new models Make it possible to add new, scalable algorithms Integration Objectives Analysts (users) DM Vendors

DM Concepts to Support • Representation of input (cases) • Representation of models • Specification of training step • Specification of prediction step Should be independent of specific algorithms

Types of Columns • Keys:Columns that uniquely identify a case • Attributes: Columns that describe a case • Value: A state associated with the attribute in a specific case • Attribute Property: Columns that describe an attribute • Unique for a specific attribute value (TV is always an appliance) • Attribute Modifier: Columns that represent additional “meta” information for an attribute • Weight of a case, Certainty of prediction Single case!

Representing a DMM • Specifying a Model • Columns it should predict • Algorithm to use • Special parameters • Model is represented as a nested table • Specification = Create table • Training = Inserting data into the table • Predicting = Querying the table

Training a DMM • Training a DMM requires passing it “known” cases • Use an INSERT INTO in order to “insert” the data to the DMM • The DMM will usually not retain the inserted data • Instead it will analyze the given cases and build the DMM content (decision tree, segmentation model) • INSERT [INTO] <mining model name> [(columns list)] <source data query>

Making Predictions SELECT [Customers].[ID], MyDMM.[Hair Color], PredictProbability(MyDMM.[Hair Color]) FROM MyDMMPREDICTION JOIN [Customers] ONMyDMM.[Gender] = [Customers].[Gender]AND MyDMM.[Age] = [Customers].[Age]

Research DirectionsMRDM/ILP

MRDM Accomplishments • ILP origins, hypothesis discovery • Classification • Clustering • Frequent itemsets • Equational discovery • Subgroup discovery • Extensions of Bayesian nets to multiple relations via key-foreign key traversals

Issues • Can we indeed capture the semantics exactly for each of these classes of patterns/models? • Taking into account the details of the underlying evaluation algorithm! • Is the performance comparable to specialized algorithms? Is it acceptable for a broad range of applications?

Positives • Impressive! Quite a range of patterns/models are shown to be expressible in this formalism • Importantly, the added expressiveness allows new kinds of patterns to be naturally formulated by a user • There is a (more or less) common computational structure consisting of • Space of patterns to search • Measure of support for a pattern • Enumeration and pruning strategy over search space What tangible benefits can we derive from this generality?

Challenges, Opportunities • If ILP notation is roughly analogous to relational calculus, what is the appropriate algebra? • Equivalences, compositionality • Cost-based optimization to find “optimal” evaluation plans • What kind of user input/domain knowledge can be used to focus computation, or help with optimization?

Research DirectionsRelational Clustering

Problem Statement A B C a1 b1 c1 c2 a2 b2 a3 b3 c3 a4 c4 • Goal: Discover clusters of attribute-values • Data: A table T with attributes drawn from domains D1,…,Dn • Thus, a tuple of T consists of a value from each domain, e.g., (a1,b2,c1) • T could be an arbitrary view over several tables! Note: We expect sizes of D1,…,Dnto be small

STIRR (Gibson, Kleinberg, Raghavan, VLDB 98) • Intuition: Want to detect that “Honda and Toyota are related because unusually high numbers of both were sold in August.” • If we also find that many Hondas and Nissans are sold in Sept, and many dealers sell both Hondas and Acuras, this leads to a cluster best described as “late-summer sales of Japanese cars” • Approach: Techniques for spectral graph partitioning, generalized to hypergraphs. • Attribute values as weighted vertices in a graph; edges based on co-occurrence. Weights propagate along links, leading to a non-linear dynamical system.

CACTUS (Ganti, Gehrke, Ramakrishnan, KDD 99) • Same motivation, different problem formulation and approach • Precise definition of cluster, deterministic algorithm that computes all clusters • Very efficient, scalable, SQL-based algorithm

Similarity Between Attributes A B C a1 b1 c1 c2 a2 b2 a3 b3 c3 a4 b4 c4 • “similarity’’ between a1 and b1 support(a1,b1) = number of tuples containing (a1,b1) • a1 andb1 are strongly connected if support(a1,b1) is higher than expected • {a1,a2,a3,a4} and {b1,b2} are strongly connected if all pairs are Not strongly connected

Similarity Within an Attribute a1 b1 c1 c2 a2 b2 a3 b3 c3 a4 b4 c4 • simA(b1,b2): Number of values of A which are strongly connected with both b1 and b2 A B C

Cluster Definition • Region: A cross-product of sets of attribute values: C1 x … x Cn • C=C1 x … x Cn is a cluster iff • Ci and Cj are strongly connected, for all i,j • Ci is maximal, for all i • Support(C) > expected Ci: cluster projection of C on Ai

The CACTUS Algorithm • Summarize • Inter-attribute summaries: Scan dataset • Intra-attribute summaries: Query IA summaries • Clustering phase • Compute cluster projections • Level-wise synthesis of cluster projections to form candidate clusters • Validation • Requires a scan of the dataset

Inter-Attribute Summaries a1 b1 c1 c2 a2 b2 a3 b3 c3 a4 b4 c4 • Supports of all strongly connected attribute value pairs from different attributes • Similar in nature to “frequent’’ 2-itemsets • So is the computation A B C

Intra-Attribute Summaries a1 b1 c1 c2 a2 b2 a3 b3 c3 a4 b4 c4 • simA(B): Similarities through Aof attribute value pairs of B A B C

Mining: A Database Perspective

Mining: A Database Perspective

Presentation Transcript

Regression for Data Mining

Data Mining: Concepts and Techniques

Opinion Mining A Short Tutorial

Chapter 2 Data Mining

Mining Billion-Node Graphs - Patterns and Algorithms

Data Mining

Data Mining Tools

Oracle Database Administration

Progress Database Admin

Oracle Database Administration

Web Mining : A Bird ’ s Eye View

CS590D: Data Mining Prof. Chris Clifton

Concepts of Database Management Seventh Edition

Mining Billion-node Graphs: Patterns, Generators and Tools

Mining Complex Types of Data

Data Mining 2

Purdue-Tivoli Partnership: Exploiting Purdue’s Technological Prowess

Introduction to Database Design Methodology

Introduction to Database