Interactive Data Exploration using Constraints

Interactive Data Exploration using Constraints Alexander Kalinin Ugur Cetintemel, Stan Zdonik

CP + DBMSfor Data Intensive Exploration

Interactive Data Exploration (IDE) Where’s Horrible Gelatinous Blob? Where’s Waldo? Searching for the “interesting” within big data • Exploratory-analysis: ad-hoc & repetitive • Questions are not well defined • “Interesting” can be complex • Human-in-the loop operation • Fast, online results • Query refinement

Exploratory Queries: Some examples • First-order • “Celestial 3-5o by 5-7o regions with brightness > 0.8” • Higher-order • “Pairs of 2o by 2o celestial regions with similarity > 0.5” • Optimized • “Celestial 3o by 7o region with maximum brightness” Sloan Digital Sky Survey (SDSS)

“Celestial 3-5o by 5-7o regions with average brightness > 0.8” in SQL Divide the data into cells Enumerate all regions Final filtering (> 0.8)

DBMSs for IDE? • No native support for exploratory constructs • No power set • No user-defined objective functions • No support for interactivity • No online results • No notion of a “query session”

Data Exploration as a CP problem “Celestial 3-5o by 5-7o regions with average brightness > 0.8” • Decision variables: • Constraints: Left-most corner Lengths

CP Solvers • Large variety of methods for exploring a search space • Branch-and-Cut • Large Neighborhood Search (LNS) • Randomized search with Restarts • Highly extensible – important for ad-hoc exploration! • New constraints/functions • New search heuristics • But… comparing with DBMSs • In-memory data (CP) vs. efficient disk data handling (DBMS) • No I/O cost-awareness (CP) vs. cost-based query planning (DBMS)

SearchLight SearchLight Exploration Query Metadata Buffering • A fusion of CP solvers and DBMSs • The DBMS stores and maintains data • The CP solver explores the constrained search space • SearchLight is a mediator • Extends CP solvers • Provides buffering, prefetching • Distributes the search • Makes CP solvers cost-aware Constraints/ Functions Search Heuristics Data, schema info Requests, Solutions Data, estimates, decisions Data requests, constraints DBMS (PostgreSQL, SciDB) CP Solver (OR-tools, Gecode)

Research Issues • A cost model for data-intensive CP • Each search decision has an I/O cost • Mediation of data access • Meta-data for guiding and optimizing search (annotated trees, samples, etc.) • Prefetching • Distributed search • Multi-node parallel branch processing • CP/DBMS integrated query planning • Propagating CP/Schema constraints

Semantic Windows (SW) • First step towards constraint-based exploration • Supports first-order queries • Exploration via multi-dimensonal “windows of interest” • Shape-based constraints (“a 3-5o by 5-7o region”) • Content-based constraints (“avg_br() > 0.8") • Custom distributed cost-aware solver

SQL/CP Extensions for Data Exploration SELECTlb(ra), rb(ra), lb(dec), rb(dec), avg(brightness) FROMsdss GRID BYraBETWEEN 100 AND 300 STEP 1 decBETWEEN 5 AND 40 STEP 1 HAVINGavg(brightness) > 0.8 AND size(ra) = 5AND size(dec) >= 5 AND size(dec) <= 7

Cost-aware Solver • Best-first search based on the utility • Utility = f(benefit, cost) • Benefit – how close a window is to satisfy the constraints • A distance between the constraint’s value and the estimated value • Cost – how expensive it is to read a window from disk • Measured in cells we have to read • Adjustments are made for skewed data

Optimizations • Cost and benefit are estimated by sampling • Objective function values are cached in a cell cache • Dynamic utility updates • Avoiding same cells re-reads • Constraint-based pruning during the search • Distributed search • Multiple nodes work in parallel

Adaptive Prefetching No prefetching • Dispersed reads hit total performance • Prefetching: read the neighborhood with every window • Progress-drivenprefetching: how much? • Finding new results? Prefetch a small amount • No new results? Increase the prefetchexponentially 1 3 4 2 With prefetching 3 1 2 4

Online vs. Total Performance Results • 35GB data set (part of the SDSS) • 4GB total memory (1GB shared buffer) • First results in 10-20 seconds

Conclusions • Integrate CP and DBMS technologies • SearchLight: Data-Intensive CP Engine • Initial implementation: Semantic Windows • Cost-aware solver • Mediating disk access (sampling, prefetching) • Distributed search • Current work: • OR-Tools as the CP solver • SciDB as the DBMS

Questions? Supported by:

Interactive Data Exploration using Constraints