Database Implementation of a Model-Free Classifier

University of Athens ADBIS 2007 Database Implementation of a Model-Free Classifier Konstantinos Morfonios

Introduction Motivation LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work

ω1 = ω2 = Introduction Classification x = <x1, x2, …, xD> ω = f(x)

Introduction <x1,1, x1,2, …, x1,D, ω1> <x2,1, x2,2, …, x2,D, ω2> <x3,1, x3,2, …, x3,D, ω1> <x4,1, x4,2, …, x4,D, ω1> . . . x1 = <x1, x2, …, xD> x2 = <x1, x2, …, xD> “Lazy” “Eager” (Nearest Neighbors) (Decision Trees) (+) Faster decisions (-) Large/complex datasets (-) Dynamicdatasets (-) Dynamicmodels

Motivation • Large/complex datasets

Motivation

Motivation • Large/complex datasets • Dynamic datasets

Motivation

Motivation • Large/complex datasets • Dynamic datasets • Dynamic models

Motivation

Motivation • Large/complex datasets • Dynamic datasets • Dynamic models Lazy (model-free)

Disk-based Motivation • Large/complex datasets • Dynamic datasets • Dynamic models Lazy (model-free) Nearest Neighbors

Motivation LOCUS (Lazy Optimal Classifier of Unlimited Scalability) Suffers from “curse of dimensionality” • Not reliable [Beyer et al., ICDT 1999] • Not indexable [Shaft et al., ICDT 2005] Nearest Neighbors

Motivation LOCUS (Lazy Optimal Classifier of Unlimited Scalability) • Category?

Motivation LOCUS (Lazy Optimal Classifier of Unlimited Scalability) • Lazy

Motivation LOCUS (Lazy Optimal Classifier of Unlimited Scalability) • Lazy • Scaling?

Motivation LOCUS (Lazy Optimal Classifier of Unlimited Scalability) • Lazy • Based on simple SQL queries

Motivation LOCUS (Lazy Optimal Classifier of Unlimited Scalability) • Lazy • Based on simple SQL queries • Accuracy?

Motivation LOCUS (Lazy Optimal Classifier of Unlimited Scalability) • Lazy • Based on simple SQL queries • Converges to optimal Bayes Classifier

Motivation LOCUS (Lazy Optimal Classifier of Unlimited Scalability) • Lazy • Based on simple SQL queries • Converges to optimal Bayes Classifier • Other features?

Motivation LOCUS (Lazy Optimal Classifier of Unlimited Scalability) • Lazy • Based on simple SQL queries • Converges to optimal Bayes Classifier • Parallelizable

f2 ω1 = x = <f1, f2> (f1 [0, 20], f2 [0, 10]) ω2 = f1 LOCUS Example

f2 f1 LOCUS Ideally: Dense space

LOCUS f2 ω(<7, 4>) = ? Ideally: Dense space f1

LOCUS f2 ω(<7, 4>) = f1

LOCUS f2 • Many features • Large domains  Sparse space Reality: f1

Many features • Large domains  Sparse space Reality: LOCUS f2 ω(<7, 4>) = ? ? f1

LOCUS ω1: 2 f2  ω(<7, 4>) = ? ω2: 1 f1 3-NN

LOCUS ω1: 2 f2  ω(<7, 4>) = ω2: 1 f1 3-NN

LOCUS f2 ω(<7, 4>) = ? f1 LOCUS

LOCUS ω1: 7 f2  ω(<7, 4>) = ? ω2: 3 f1 LOCUS

LOCUS ω1: 7 f2  ω(<7, 4>) = ω2: 3 f1 LOCUS

LOCUS f2 Disk-based implementation f1 LOCUS

2δ2 2δ1 LOCUS SELECT ω, count(*) FROM R WHERE f1≥x1-δ1 AND f1≤x1+δ1 AND f2≥x2-δ2 AND f2≤x2+δ2 GROUP BY ω R(f1, f2, ω) ω1: 7  ω(<7, 4>) = ω2: 3 <x1, x2>

LOCUS SELECT ω, count(*) FROM R WHERE f1≥x1-δ1 AND f1≤x1+δ1 AND f2≥x2-δ2 AND f2≤x2+δ2 GROUP BY ω R(f1, f2, ω) What ifR is large? Classical optimization techniques for a well-known type of aggregate queries • Indexing • Materialized views • Presorting

LOCUS SELECT ω, count(*) FROM R WHERE f1≥x1-δ1 AND f1≤x1+δ1 AND f2≥x2-δ2 AND f2≤x2+δ2 GROUP BY ω R(f1, f2, ω) Method reliability? LOCUS converges to the optimal Bayes classifier as the size of the dataset increases (proof in the paper)

LOCUS SELECT ω, count(*) FROM R WHERE f1≥x1-δ1 AND f1≤x1+δ1 AND f2≥x2-δ2 AND f2≤x2+δ2 GROUP BY ω R(f1, f2, ω) What if a feature, sayf2, is categorical? (e.g. sex)

LOCUS SELECT ω, count(*) FROM R WHERE f1≥x1-δ1 AND f1≤x1+δ1 AND f2=x2 GROUP BY ω R(f1, f2, ω) What if a feature, sayf2, is categorical? (e.g. sex) Not a problem, since generally in practice: • Combinations of categorical and numericfeatures • Categorical features have small domains Hence, they do not contribute to sparsity

R1 R2 R3 R4 SELECT SELECT SELECT SELECT Parallel Execution R = R1 R2 R3 R4

R1 R4 R3 R2 12 3 5 2 18 3 23 4 Parallel Execution Count: distributive function ω1: 23 ω1: 7 ω1: 5 ω2: 4 ω2: 1 ω2: 2 ω1: 6 ω2: 0 ω1: 5 ω2: 1

R1 R4 R3 R2 12 3 5 2 18 3 23 4 ω1: 7 ω1: 5 SELECT SELECT ω2: 1 ω2: 2 ω1: 6 SELECT ω2: 0 SELECT ω1: 5 ω2: 1 Parallel Execution • Small network traffic • Load balancing • Lightweight operations on the main server ω1: 5 ω1: 7 ω2: 2 ω2: 1 ω1: 6 ω2: 0 ω1: 5 ω2: 1

Database Implementation of a Model-Free Classifier

Database Implementation of a Model-Free Classifier

Presentation Transcript

An Interval Classifier for Database Mining Applications

The Relational Model Mapping the ER Model to a Database Implementation

A model of ERP project implementation

A new model of curriculum implementation

Database Implementation Issues

Train a Classifier Based on the Huge Face Database

Effectiveness of a Product Quality Classifier

Creating a … Community Database Organism-Specific Database Model-Organism Database

A SCAP Database Model

Normalization of database model

A Virtual Distributed Database Model for Creating a Database Federation

Model Checking of a lock-free stack

Implementation of Vector Space Model

Model Database

Implementation of a Visual Attention Model

Implementation of DATABASE MANAGEMENT SYSTEMS

A tracking database for clinical implementation of microdosimeters

Normalization of database model

Database Implementation Issues

An Interval Classifier for Database Mining Applications

A simple classifier