Learning to Create Data-Integration Queries

Learning to Create Data-Integration QueriesParthaPratimTalukdar, Marie Jacob, Muhammad SalmanMehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira, SudiptoGuhaVLDB2008 Seminar Presented byNoel GunasekarCSE Department – SUNY Buffalo

Learning to Create Data-Integration Queries Introduction Motivation Example Existing solutions Q-System Solution Q-System Architecture Query and Query Answers Executing Query Learning From feedback Conclusion Experimental results Future Work Learning to Create Data Integration Queries

Introduction Learning to Create Data Integration Queries

Motivation • Need for non-expert user to pose queries across multiple data resources. • Non-expert user - Not familiar with querying languages • Multiple resource - Databases, Data warehouses, Virtual integrated schemas Learning to Create Data Integration Queries

Bio-Science Field • Many "standardized" databases with overlapping and cross-referenced information • Each site is being independently extended, corrected, and analyzed • Differing levels of data quality/confidence Protein Databases Protein DataBase: - PDB information and service listings at Brookhaven National Laboratory [ BNL ] PIR: - Protein Identification Resource database at [ JHU ] PRF: - Protein Research Foundation database at GenomeNet SwissProt - Protein database at ExPASy [ Switzerland ] Learning to Create Data Integration Queries

Example • “What are the proteins and genes associated with the disease Narcolepsy?” • Life Sciences researcher querying on data-sources like genomics, disease studies and pharmacology. genomics Life Sciences Researcher Disease Studies http://www.expasy.org/uniprot/P04049 Pharmacology Learning to Create Data Integration Queries

ExistingSolution • Using keyword based queries on Web-Forms • Match the keywords with terms in the tuples and form the query by joining different databases using foreign-keys • Cost for the query is fixed and doesn’t accommodate the context of the query http://www.expasy.org/uniprot/P0C852 Learning to Create Data Integration Queries

Proposed Solution - Q System Automatically generate Web-Forms for given set of keywords Pose queries across multiple data resources using the generated web-form Learning to Create Data Integration Queries

Proposed Solution - Q System Create re-usable web-form User (Author) Q System Keywords Protein, gene, disease Reusable Web-Form For querying Use web-form for Querying Users (Author + others) Query Results Parameters Reusable Web-Form For querying Learning to Create Data Integration Queries

Q System Architecture Learning to Create Data Integration Queries

Architecture of Q System • Four Components • Initial Schema Loader • Query Template Creation • Query Execution • Learning Through Feedback Learning to Create Data Integration Queries

Architecture of Q System Learning to Create Data Integration Queries

Initial Setup • Schema Loader • Input • Given a set of data sources with its own schema • Foreign Keys and Links • Schema Mappings • Record Link • Output • Schema Graph Learning to Create Data Integration Queries

b c 0.07 0.04 0.1 d Initial Setup Example Schema Graph Node: Databases and their attributes (UniProt database, Entrez GeneInfo db, term) Edge: Relation based on foreign keys/cross-references (UniProt to PIR) Cost: Reliability, completeness Learning to Create Data Integration Queries

Query Template Creation Learning to Create Data Integration Queries

Query Template Creation - Example Input: “protein”, “plasma membrane”, “gene” and “disease” Output: Learning to Create Data Integration Queries

a b c 0.1 0.07 0.04 0.1 d f 0.1 0.1 e Find trees connecting red nodes a b a b c 0.1 0.1 0.07 0.04 0.1 d f d f 0.1 0.1 0.1 0.1 Rank = 1 Cost = 0.4 Rank = 2 Cost = 0.41 e e Query Template Creation - Example Query Keywords a, e, f Schema Graph Q1 Q2

Query Formulation • Trees can be easily written as executable queries: Steiner Tree a b 0.1 0.1 d f 0.1 0.1 e Conjunctive query: a(x,y),b(y,z),d(z,w),e(w,u),f(w,v)

View Refinement

Web-Form

Query Execution Learning to Create Data Integration Queries

Input: Web-Form

Output: Result Answers Q1 Q1 Q1,2 Q2 Q2 Q2 System determines “producer” queries using provenance

Query Execution • Query Processing Engine with • Support for querying remote data sources • Record data provenance • Solution: • ORCHESTRA • http://www.cis.upenn.edu/~zives/orchestra/ Learning to Create Data Integration Queries

Orchestra Project • The ORCHESTRA project focuses on the challenges of data sharing scenarios in the sciences • Bioinformatics Scenario - many "standardized" databases with overlapping information, similar but not identical data and differing levels of data quality/confidence • Each site is being independently extended, corrected, and analyzed • ORCHESTRA collaborative data sharing system (CDSS) is on how to support reconciliation across different schemas, with disagreeing users Learning to Create Data Integration Queries

Orchestra Project – Data Provenance http://www.cis.upenn.edu/~zives/research/exchange.pdf Learning to Create Data Integration Queries

Learning through Feedback Learning to Create Data Integration Queries

Learning through Feedback • Input: Ranked Results + provenance Q1 Q1 Q1,2 Q2 Q2 Q2 Learning to Create Data Integration Queries

Learning through Feedback • User provides feedback Q1 Q1 Q1,2 Q2 Q2 Q2 Learning to Create Data Integration Queries

Query Formulation - Recap a b c 0.1 0.07 0.04 0.1 d f 0.1 0.1 e Find trees connecting red nodes a b a b c 0.1 0.1 0.07 0.04 0.1 d f d f 0.1 0.1 0.1 0.1 Rank = 1 Cost = 0.4 Rank = 2 Cost = 0.41 e e Query Keywordsa, e, f Schema Graph Q1 Q2

Learning through Feedback a b a b c 0.1 0.1 0.07 0.05 0.04 Q1 0.1 Q2 d f d f 0.1 0.1 0.1 0.1 Rank = 1 Cost = 0.39 Rank = 2 Cost = 0. 41 Rank = 1 Cost = 0.4 Rank = 2 Cost = 0.4 e e • Change weights so Q2 is “cheaper” than Q1 a b c 0.05 0.07 0.1 0.04 0.1 d f 0.1 0.1 e

Iteration! Learning to Create Data Integration Queries

Q-System: Challenges • Computation of ranked queries which in turn produce ranked tuples: K-Best Steiner Tree Generation • Predicting new query rankings based on user feedback over tuples, and also generalizing feedback: Learning • Maintaining associations between tuples and queries: Query answers with provenance • Everything at interactive speed!

f= 1 if the edge connects Term and Synonym tables, else 0 Term Synonym f 1 w8 Cost of a Query • Query Cost = Sum of edge costs in the tree. • Edge Cost = Sum of weights of features defined over it. • Features are properties of the edges, e.g., nodes connected • Each feature has a corresponding weight. • Feature example:

a b c 0.1 0.07 0.04 0.1 d f 0.1 0.1 e Steiner Trees: Finding Lowest-Cost Queries • A tree of minimal cost in a graph (G) which includes all the required nodes (S). • Cost of a Steiner Tree is the sum of costs of edges present in the tree. • Steiner Tree is generalization of Minimum Spanning Tree (MST) [equivalent when S = all vertices in G].

K-Best Steiner Tree Algorithms • Exact (practical for ~100 nodes and edges). • Integer Linear Program (ILP) based formulation for finding K-best Steiner Trees in a graph. • The ILP uses ideas from multi-commodity network flows • Approximate (for 100s+ nodes and edges). • Novel Shortest Paths Complete Subgraph Heuristic. • Significantly faster; in practice, often gives optimal solution.

Multi-Commodity Flow Problem

MIP for min-cost Steiner Tree

MIP for K min-cost Steiner Tree

Constraints C1 :Flow of commodity k starts at root r C2 : Flow of commodity k terminates at node k C3 : Conservation of flow at Steiner nodes C4 : Flow of an edge allowed only if that edge is included ( Yij = 1 ) C5 : Non-negativity constraint C6 : Defines value for Y C7 : Ensures no incoming active edge into the root C8 : Ensures that all nodes have at most one incoming active edge C9 : Flow of at least one commodity on all edges in I C10 : Ensures no flow on edges in X

a b c 0.1 0.07 0.04 0.1 d f 0.1 0.1 e a b a b c 0.1 0.1 0.07 0.04 0.1 d f d f 0.1 0.1 0.1 0.1 Rank = 1 Cost = 0.4 Rank = 2 Cost = 0.41 e e Finding K-Best Steiner Trees 2-best Steiner trees connecting terminal nodes.

K-Best Steiner Tree Algorithms • Approximate (for 100s+ nodes and edges). • Novel Shortest Paths Complete Subgraph Heuristic. • Send “m” shortest path graph as input. • Shortest path between each pair of nodes in S • Significantly faster; in practice, often gives optimal solution.

Tuples Query a b Q d f T e . . . . . . a b c Q* d f T* e Q Challenge : Getting User Feedback Top . . . Bottom T and T* differ in 3 edges. This difference is termed loss: L(T, T*)

Learning: Update Weights Edge Cost: 0.07 Term(T1) Term2Term w8 = 0.06 w25 = 0.01 Edge Cost Update Edge Cost: 0.05 Term(T1) Term2Term w8 = 0.04 w25 = 0.01

Re-ranked Steiner Trees a b a b c 0.1 0.1 0.07 0.04 0.1 d f d f 0.1 0.1 0.1 0.1 e e Weight Update a b c a b 0.1 0.05 0.1 0.04 0.1 d f d f 0.1 0.1 0.1 0.1 e e Rank 1 Rank 2

Experimental Results • The Key Questions • Can the algorithm start with uninitialized weights and learn expert (“gold standard”) ranking of queries? • Can the results be generated at interactive speeds? • Does the approach scale to larger graphs?

# Gold queries absent in top-3 predictions Results: Learning Expert Weights • Graph: Start with the BioGuide bio sources, with 28 vertices and 96 edges. • Goal: Learn the queries corresponding to the expert-set weights in BioGuide • Methodology: • All weights are set to default. • Sequence of 25 searches • For each, user feedback identifies & promotes a tuple from the gold standard answer. • After 40-60% searches with feedback, system finds the top query immediately. • For each individual search, a single feedback is enough to learn the top query.

Results: Time to generate K-best Queries • Schema graph of size (28, 96) from BioGuide (Boulakia et al., 2007). It is possible to generate the top query in < 1 sec and top 5 queries in about 2 sec, all within interactive range. Query execution is pipelined.

Results: Scalability to Larger Graphs • Larger schema graph of size (408, 1366) from real sources: GUS, GO, BioSQL. Queries (K) It is possible to do K-best inference in larger graphs quickly and with little or no loss (none in this case).

Experimental Results • “Gold Standard” • Using BioGuide – a biomedical information integration system • BioGuide generates the schema graph based on keywords • The edge cost in the schema graph are manually assigned by experts • This expert given schema graph is called the “gold standard” • Experiment involves in comparing the result produced by the q system with the results produced by the gold standard. Learning to Create Data Integration Queries

Learning to Create Data-Integration Queries

Learning to Create Data-Integration Queries

Presentation Transcript

Data Integration

Data Integration to Data Governance

Data integration

Data Management: Queries

Data Integration

Data Integration

Learning to Create Presentations

Data Integration

Data Integration

Data Integration

Data integration

Custom Data Queries

Queries for data modification: Action queries

Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems

Data integration

Data Integration

Data Integration

Overview: Ask your data — create queries

Data Queries

Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems

Data Management: Queries

From Data Access to Data Integration