1 / 18

Interactive Data Exploration using Constraints

Interactive Data Exploration using Constraints. Alexander Kalinin Ugur Cetintemel, Stan Zdonik. CP + DBMS for Data Intensive Exploration. Interactive Data Exploration (IDE). Where’s Horrible Gelatinous Blob?. Where’s Waldo?. Searching for the “interesting” within big data

jeanne
Télécharger la présentation

Interactive Data Exploration using Constraints

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Interactive Data Exploration using Constraints Alexander Kalinin Ugur Cetintemel, Stan Zdonik

  2. CP + DBMSfor Data Intensive Exploration

  3. Interactive Data Exploration (IDE) Where’s Horrible Gelatinous Blob? Where’s Waldo? Searching for the “interesting” within big data • Exploratory-analysis: ad-hoc & repetitive • Questions are not well defined • “Interesting” can be complex • Human-in-the loop operation • Fast, online results • Query refinement

  4. Exploratory Queries: Some examples • First-order • “Celestial 3-5o by 5-7o regions with brightness > 0.8” • Higher-order • “Pairs of 2o by 2o celestial regions with similarity > 0.5” • Optimized • “Celestial 3o by 7o region with maximum brightness” Sloan Digital Sky Survey (SDSS)

  5. “Celestial 3-5o by 5-7o regions with average brightness > 0.8” in SQL Divide the data into cells Enumerate all regions Final filtering (> 0.8)

  6. DBMSs for IDE? • No native support for exploratory constructs • No power set • No user-defined objective functions • No support for interactivity • No online results • No notion of a “query session”

  7. Data Exploration as a CP problem “Celestial 3-5o by 5-7o regions with average brightness > 0.8” • Decision variables: • Constraints: Left-most corner Lengths

  8. CP Solvers • Large variety of methods for exploring a search space • Branch-and-Cut • Large Neighborhood Search (LNS) • Randomized search with Restarts • Highly extensible – important for ad-hoc exploration! • New constraints/functions • New search heuristics • But… comparing with DBMSs • In-memory data (CP) vs. efficient disk data handling (DBMS) • No I/O cost-awareness (CP) vs. cost-based query planning (DBMS)

  9. SearchLight SearchLight Exploration Query Metadata Buffering • A fusion of CP solvers and DBMSs • The DBMS stores and maintains data • The CP solver explores the constrained search space • SearchLight is a mediator • Extends CP solvers • Provides buffering, prefetching • Distributes the search • Makes CP solvers cost-aware Constraints/ Functions Search Heuristics Data, schema info Requests, Solutions Data, estimates, decisions Data requests, constraints DBMS (PostgreSQL, SciDB) CP Solver (OR-tools, Gecode)

  10. Research Issues • A cost model for data-intensive CP • Each search decision has an I/O cost • Mediation of data access • Meta-data for guiding and optimizing search (annotated trees, samples, etc.) • Prefetching • Distributed search • Multi-node parallel branch processing • CP/DBMS integrated query planning • Propagating CP/Schema constraints

  11. Semantic Windows (SW) • First step towards constraint-based exploration • Supports first-order queries • Exploration via multi-dimensonal “windows of interest” • Shape-based constraints (“a 3-5o by 5-7o region”) • Content-based constraints (“avg_br() > 0.8") • Custom distributed cost-aware solver

  12. SQL/CP Extensions for Data Exploration SELECTlb(ra), rb(ra), lb(dec), rb(dec), avg(brightness) FROMsdss GRID BYraBETWEEN 100 AND 300 STEP 1 decBETWEEN 5 AND 40 STEP 1 HAVINGavg(brightness) > 0.8 AND size(ra) = 5AND size(dec) >= 5 AND size(dec) <= 7

  13. Cost-aware Solver • Best-first search based on the utility • Utility = f(benefit, cost) • Benefit – how close a window is to satisfy the constraints • A distance between the constraint’s value and the estimated value • Cost – how expensive it is to read a window from disk • Measured in cells we have to read • Adjustments are made for skewed data

  14. Optimizations • Cost and benefit are estimated by sampling • Objective function values are cached in a cell cache • Dynamic utility updates • Avoiding same cells re-reads • Constraint-based pruning during the search • Distributed search • Multiple nodes work in parallel

  15. Adaptive Prefetching No prefetching • Dispersed reads hit total performance • Prefetching: read the neighborhood with every window • Progress-drivenprefetching: how much? • Finding new results? Prefetch a small amount • No new results? Increase the prefetchexponentially 1 3 4 2 With prefetching 3 1 2 4

  16. Online vs. Total Performance Results • 35GB data set (part of the SDSS) • 4GB total memory (1GB shared buffer) • First results in 10-20 seconds

  17. Conclusions • Integrate CP and DBMS technologies • SearchLight: Data-Intensive CP Engine • Initial implementation: Semantic Windows • Cost-aware solver • Mediating disk access (sampling, prefetching) • Distributed search • Current work: • OR-Tools as the CP solver • SciDB as the DBMS

  18. Questions? Supported by:

More Related