310 likes | 405 Vues
Interactive Query Formulation over Web Service-Accessed Sources. Michalis Petropoulos Alin Deutsch Yannis Papakonstantinou. ACM SIGMOD, June 2006. Large-Scale Data Integration Systems. . Web Domain. Web Forms & Reports. CNET’s Top Combinations CNET’s Search Desktops
E N D
Interactive Query Formulationover Web Service-Accessed Sources Michalis Petropoulos Alin Deutsch Yannis Papakonstantinou ACM SIGMOD, June 2006
Large-Scale Data Integration Systems Web Domain Web Forms & Reports • CNET’s Top Combinations • CNET’s Search Desktops • PCWorld’s Product Finder End User Application Domain • CNET Computer • PCWorld Portals Developer Application Application Integration Domain Mediator Integrated Schema Compatible Combinations of Computers, Routers and Printers Integration Engineer Source Domain • Dell Computers by CPU • Cisco Routers by Rate • HP Printers by Speed Web Service Web Service Web Service … Source Owner Data Source Data Source • Dell Computers • Cisco Routers • HP Printers Source Schema Source Schema …
Large-Scale Data Integration Systems Web Domain Web Forms & Reports End User Application Domain What queries can the mediator answer for me? CLIDE Developer Application Application Integration Domain Mediator Integrated Schema Integration Engineer Source Domain Web Service Web Service Web Service … Source Owner Data Source Data Source Source Schema Source Schema …
Running Example Parameterized Views Schema Computers(cid, cpu, ram, price) NetCards(cid, rate, standard, interface) Views V1ComByCpu(cpu) (Computer)* SELECT DISTINCT Com1.* FROM Computers Com1 WHERE Com1.cpu=cpu V2 ComNetByCpuRate(cpu, rate) (Computer, NetCard)* SELECT DISTINCT Com1.*, Net1.* FROM Computers Com1, Network Net1 WHERE Com1.cid=Net1.cid AND Com1.cpu=cpu AND Net1.rate=rate Dell Cisco Schema Routers(rate, standard, price, type) Views V3 RouWired() (Router)* SELECT DISTINCT Rou1.* FROM Routers Rou1 WHERE Rou1.type='Wired' V4 RouWireless() (Router)* SELECT DISTINCT Rou1.* FROM Routers Rou1 WHERE Rou1.type='Wireless' Wired Routers Computers for a givencpu Wireless Routers Computers & NetCards for a givencpu & rate Conjunctive Queries CQ • Equality & Comparison Conditions • Parameters
Running Example Integrated Schema Developer Application Mediator Integrated Schema • Integrated schema puts togetherthe Dell and Cisco schemas Attribute Associations • (Computers.cid, NetCards.cid) • (NetCards.rate, Routers.rate) • (NetCards.standard, Routers.standard) V1 V2 V3 V4 Dell Cisco
Sophisticated Mediators MakeFeasible Queries Hard to Predict Feasible Queries FQ • Equivalent CQ query rewritings using the views • Might involve more than one views • Order might matter Query: Get all ‘P4’ Computers, together with their NetCards and their compatible ‘Wireless’ Routers Query: Get all Computers Feasible Infeasible E Mediator B D Mediator A C RouWireless() ComNetByCpuRate(‘P4’, ‘10’) V1 ComNetByCpuRate(‘P4’, ‘54’) V4 V2
Problem • Large number of sources • Large number of views (web-services) • Mediator capabilities Developer formulates an application query • Is an application query feasible? • If not, how do I know which ones are feasible? Previous options: • The developer had to browse the view definitions and somehow formulate a feasible query • Or formulate queries until a feasible one is found(trial-and-error) No system-provided guidance
The CLIDE Solution CLIDE • A query formulation interface, which interactivelyguides the developer toward feasible queries by employing a coloring scheme Developer Application Mediator Integrated Schema V1 V2 V3 V4 Dell Cisco
QBE-Like Interfaces Microsoft SQL-Server
CLIDE Interface Feasibility Flag Selection Boxes • Table, selection, projection and join actions • Feasibility Flag • Color-based suggestions
Example Interaction Snapshot 1 Yellow required action • All feasible queries require this action White optional action • Feasible queries can be formulatedw/ or w/o these actions
Example Interaction Snapshot 2 Blue required choice of action • At least one of them is required to be taken in order to reach a feasible query Join Lines: • Only yellow and blue are displayed • Must appear in Attribute Associations
CLIDE Properties • Completeness of Suggestions • Every feasible query can be formulated by performing yellow and blue actions at every step • Minimality of Suggestions • At every step, only a minimal number of actions is suggested, i.e., the ones that are needed to preserve completeness • Rapid Convergence By Following Suggestions • The shortest sequence of actions from a query to any feasible query consists of suggested actions
Example Interaction Snapshot 3 • * any other constant • Red prohibited action • Does not appear in any feasible query • Lead to “Dead End” state
Example Interaction Snapshot 4 F Mediator A D RouWireless() ComNetByCpuRate(‘P4’, rate) B E V4 V2
CLIDE Properties • Completeness of Suggestions • Every feasible query can be formulated by performing yellow and blue actions at every step • Minimality of Suggestions • At every step, only a minimal number of actions is suggested, i.e., the ones that are needed to preserve completeness • Rapid Convergence By Following Suggestions • The shortest sequence of actions from a query to any feasible query consists of suggested actions
Interaction Graph Selection Action Table Action Join Action Com1 Com1.ram Com1.price Com1.cpu=‘P4’ Net1 Com1.cid=Net1.cid Rou1 … … … … … … … … … … • Nodes are queries • One for each qCQ • Edges are actions • Table, selection, projection and join actions • Green nodes are feasible queries • Infinitely big structure • All CQ queries • All possible combinations of actions formulating them
Interaction Graph: Colors • Yellow action • Every path from current node n to a feasible node contains • Blue action • At least one path from current node n to a feasible node nF contains • There is not path from n to nF that contains a feasible node • Red action • No path to a feasible node contains Current Node Com1.cid … Com1.cpu … Current Node Com1.cid=* … … Com1.cpu=* Com1.cpu=* Net1 Com1.cid=Net1.cid Com1.cid=Net1.cid Net1.rate=’54Mbps’ … … … … Com1.ram=* Com1.cid=Net1.cid … … Net1.rate=’54Mbps’ Net1.rate=’54Mbps’ Com1.price=* … … … Com1.cpu=* Rou1 Rou1 Com1.cid=Net1.cid Net1.rate=Rou1.rate Net1 … … … … … … Rou1 Com2 Com2 … Com2 Com2.cid=Net1.cid Com2.cpu=‘P4’ Net1.rate=‘54Mbps’ … … … … …
CLIDE Properties • Rapid Convergence By Following Suggestions • The shortest path from the current node to any feasible node consists of suggested actions
CLIDE Architecture Actions • Back-End invoked every time the user performs an action • i.e., the user arrives at a new node in the interactions graph Front-End User Current Query Colored Actions + Feasibility Flag Back-End Color Algorithm Seed Queries SQ Parameters Algorithm Closest Feasible Queries FQC Closest Feasible Queries Algorithm Aliases Collapse Rule Minimal Feasible Extension Queries Maximally-Contained Rewriter Schemas Views Attribute Associations
Color DeterminedBy a Finite Set of Feasible Queries Challenge: Infinitely Many Feasible Queries ? Radius … … … n … … Closest Feasible Queries FQC … Solution: • Finite Set of Feasible Queries • Closest Feasible Queries FQC • FQC is sufficient to color actions Challenge: How far can the Closest Feasible Queries FQC be?
Closest Feasible Queries FQC Algorithm Solution: Based on Maximally Contained Queries FQMC Maximally Contained Queries FQMC pL Radius … … … n … … Closest Feasible Queries FQC … • Compute maximally contained queries FQMC • Radius pL is the longest path to a maximally contained query • Theorem: All FQC queries are reachable via a path of length p pL Challenge: The set of nodes within pL can be too many to explore
Closest Feasible Queries FQC Algorithm Solution: Find the Closest Feasible Queries Directly Maximally Contained Queries FQMC … … … n … … Closest Feasible Queries FQC … More feasible nodes • Theorem: All queries in FQMC are in FQC • But not all queries in FQC are in FQMC • Collapse Aliases to compute FQC \ FQMC
Remaining Issues of the Algorithm • Color remaining actions white or red • Handling projections • Dealing with parameterized queries
CLIDE Implementation Issues • Maximally contained queries need to be minimized(affect CLIDE’s rapid convergence and minimality) Implemented the minimization module from scratch MiniCon computes mappings between the query and the views Exposed these mappings to guide minimization • Optimizations to bring the response time below 3 sec Amortize the computation across multiple interactions steps by observing that query is changed minimally at every step Instead of calling MiniCon from scratch at every step, we incrementally compute the closest feasible queries
Thank you Play with our demo: http://www.clide.info
Maximally Contained Queries FQMC Not Maximally Contained Query: Q3 Get all Computers with a given cpu & ram Maximally Contained Query Query: Q4 Get all Computers with a given ram Maximally Contained Query Query: Q2 Get all Computerswith a given cpu Query: Q1 Get all Computers • Assuming fixed SELECT clause (projection list) • Covered extensively in literature • MiniCon, Bucket, InverseRules Algorithms • FQMC is finite
Closest Feasible Queries FQC Algorithm Solution: Collapse Aliases Maximally Contained Feasible Queries FQMC … Closest Feasible Queries FQC … n … … … … • Collapse Aliases to compute FQC \ FQMC • Check satisfiability
Color Algorithm Yellow and Blue • An action is colored based on which closest feasible queries it appear in • Yellow, if appears in all queries in FQC • Blue, if appears in at least one (but not all) query in FQC White and Red • Attach Maximum Projection Lists to Closest Feasible Queries • Projections that can be added to a feasible query, without compromising feasibility • Projection is white if in the maximum projection list • Color selections based on projections
CLIDE Implementation & Optimizations Front-End Maximally-Contained Rewriter Current Query Colored Actions + Feasibility Flag Minimal Feasible Extension Queries FQME + Maximum Projection Lists Back-End Redundant Actions Removal Color Algorithm Maximally-Contained Feasible Extension Queries + Maximum Projection Lists Seed Queries SQ Redundant Queries Removal Parameters Algorithm Feasible Extension Queries + Maximum Projection Lists Closest Feasible Queries FQC Closest Feasible Queries Algorithm Views Expansion Aliases Collapse Rule Maximally-Contained Feasible Queries over Views + Containment Mappings Minimal Feasible Extension Queries Maximally-Contained Rewriter MiniCon Containment Mappings Logging Schemas Views Attribute Associations • Views expansion introduce redundancy • Affects CLIDE’s rapid convergence and minimality • Efficient containment test crucial to redundancy removal
CLIDE Performance • Views • Schema C1 C1 … … B1 B1 Ci Ci1 B1 C1 … … … … … CiM CL CL A A Bi Bi1 B2 C1 … … … … BiM BK BK Chains of Stars • Queries A-span = 7 B-span = 3 Selections = 4,6,8,10 A