Partitioning – A Uniform Model for Data Mining

Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo

Motivation • Databases and data warehouses are currently separate systems Why? • Standard answer: • Details, details, details … • Our answer: • Fundamental issue of representation

Relations Revisited • R(A1, A2, …, AN) • Set of tuples • Any choices at a fundamental level? Yes! • Duality between • Element-based representation • Space-based representation

Element-based representation: Standard representation of tuples with all their attributes Space-based representation: The existence (count?) of a tuple is represented in its attribute space Duality

Particles can be represented by their position More fundamental level: Particle Particles can be 1 values in a grid of locations Field Similar Dualities in Physics

Space-Based Representation • Consider standard tuples as vectors in the space of attribute domains • Represent all possible attribute combinations as one bit: • 1 if data item is present • 0 if it isn’t • Allowing counts could be useful for projections (?)

Space-Based Representation as a Partition • Partitions are mutually exclusive and collectively exhaustive sets of elements • The Space-Based Representation partitions attribute space into two sets: • Data item present in database (1) • Data item not present (0)

Usefulness of Space-Based Representation • No indexes needed: instant value-based access • Index locking becomes dimensional locking • Aggregation very easy due to value-based ordering • Selections become “and”s What experience do we have with space-based representations?

Data Cube Representation • One value (e.g., sales) given in the space of the key attributes • Space-based with respect to key attributes • Element-based with respect to non-key attributes

Properties of the Domain Space • Ideally space should have distance, norm, etc. • Especially important for data mining • Does that make sense for all domains? • Can any domain be mapped to integer?

Can all Domains be Mapped to Integer? • Simplistic answer: yes! • All information in a computer is saved as bits • Any sequence of bits can be interpreted as an integer • Problems • Order may be irrelevant, e.g., hair-color • Order may be wrong, e.g., sign bit for int • Even if order is correct, spacing may vary, e.g., float (solution in paper: intervalization) • Domains may be very large, e.g., movies

Categorical attributes (irrelevant order) We need more than one attribute for an appropriate representation • Data mining solution: • 1 attribute per domain value • Our solution: • 1 attribute per bit slice • Values are corners of a Hypercube in log(Domain Size) dimensions • Distances are given trough MAX metric

Fundamental Partition(Space-Based Representation) • # of dimensions = Number of attributes • # of represented points = product of all domain sizes • Exponential in number of dimensions! • We badly need compression!

How Do We Handle Size? • Problem exponential in #of attributes • How can we reduce #of attributes? Review normalization: • We can decompose a relation into a set of relations each of which contains the entire key and one other attribute • This decomposition is • loss less • dependency preserving (BCNF relations only)

Compression for Non-Key Attributes Fundamental partition contains one non-zero data-point in any non-key dimension only • Represent number by bit-slices Note: • This works for numerical and categorical attributes Original values can be regained by anding • Example 5 (binary 101) is bit 0 & bit 1’ & bit 2

Concept Hierarchies Bit sliced representation have significant benefits beyond compression: • Bit slices can be combined into concept hierarchies: • Highest level: bit 0 • Next level: bit 0 & bit 1 • Next level: bit 0 & bit 1 & bit 2

Compression for Key Attributes • Database state-independent compression could lead to information loss (counts > 1) • Database state-dependent compression: • Tree structure that eliminates pure subtrees => P-trees

Other Ideas Compression is better if attribute values are dense within their domain • We could use extent domain • Compression good • Problems with insertion • Reorganization of storage • Index locking has to be reintroduced • …

How Good is Compression so far? • If all domains are “dense”, i.e. all values occur • Size can easily be smaller than original relation • If non-key attributes are “sparse” • Not usually a problem: good compression • Problems only in extreme cases • E.g., movies as attribute values! • If key-attributes are “sparse” • Larger potential for problems, but also large potential for benefit (see data cubes)

Are Key-Attributes Usually Sparse? • Many key attributes are dense (“structure” attributes as keys) • Automatically generated IDs are usually sequential • x and y in spatial data mining • Time in data streams • Keys in tables that represent relationships tend to be sparse (feature attributes as keys) • Student / course offering / grade • Data cubes!

What Have We Gained?(Database Aspects) • Data simultaneously acts as index • No separate index locking • (unless extent domain is used) • All information saved as bit patterns • Easy “select” • Other database operations discussed in class

What Have We Gained?(Feature Attribute Keys) • Direct mining possible on relations with feature attributes keys • E.g., student / course offering / grade • Rollup can be defined, etc. • Clustering, classification, ARM can make use of proximity inherent in representation • Bit-wise representation provides concept hierarchy for non-key attribute • Tree structure provides concept hierarchy for key attributes

What Have We Gained?(Structure Attribute Keys) • For relations with structure attribute keys mining requires “and”ing • produces counts for feature attributes • Bit-wise representation provides concept hierarchy for non-key attribute Duality: • Concept hierarchies in this representation map exactly to tree structure when the attribute is a key

Mapping Concept HierarchiesBit Slices <-> Tree P-tree: • Take key attributes, e.g. x and y, and bit interleave them: • x = 1 0 0 1 • y = 1 1 0 1 • 1 1 0 1 0 0 1 1 • Any two of these digits form a level in the P-tree – or a level in a concept hierarchy

How Could We Use That Duality? • Join with other relations and project off key attributes (Meta P-trees) • Can we do that? • We lose uniqueness • We can use 1 to represent 1 or more tuples (equivalent to relational algebra) • Or we can introduce counts • Can be useful for data mining • Need for non-duplicate eliminating counts exists also in other applications

How Do Hierarchies Benefit us in Databases? • Multi-granularity Locking • Subtrees form suitable units for storage in a block • Fast access! Proportional to • # of levels in tree • # of bits for bit slices

Summary • Space-based representation has many benefits • Value-based access and storage • No separate index needed • Rollups easy • P-Trees • Follow from systematic compression • Benefits from concept hierarchies

Partitioning – A Uniform Model for Data Mining

Partitioning – A Uniform Model for Data Mining

Presentation Transcript

Data Mining

Data Mining

Data Mining

Data Mining: An Introduction

DATA MINING

Data Mining

Applications and Trends in Data Mining

Data Mining

CHAPTER 17: DATA MINING BASICS

Data Warehousing and Mining

Ant Inspired Data Mining

AMCS/CS 340: Data Mining

CHAPTER 17: DATA MINING BASICS

Data Mining with DB

Spatial and Temporal Data Mining

Data Mining: Extracting Knowledge from Past Data

Data Mining