Enhancing Reporting Consistency in Distributed Data Sources

TRAC: Toward Recency And Consistency Reporting in a Database with Distributed Data Sources Jiansheng Huang Jeffrey F. Naughton Miron Livny

Motivating Scenario In a distributed monitoring system, autonomous nodes report in at unpredictable intervals, state captured at central site is always out of date and inconsistent

Specific Scenario • A cluster of machines • A job can be submitted to any node. • Submit node may schedule a job to be run on another machine. • The state is captured in a centralized database.

Submit job j Schedule job j Job j received here Job j running here m2 m1 State of m1 State of m2 Job j received here & scheduled to run on m2 Job j received here & scheduled to run on m2 Job j is running here Job j is running here No info about job j No info about job j Central site No info about job j No info about job j An Example

Enforcing Consistency • Option 1: Do everything in distributed transactions • Won’t scale to large systems. • At odds with the autonomous nature of nodes. • Option 2: Only present latest consistent snapshot • Can give rise to very out of date information to user.

Problem Addressed • Question: how can we help users cope with inconsistencies in collected data while retaining the scalability and autonomy of the system? • Our answer: instead of enforcing consistency, allow inconsistency and help user interpret what they see. • Issue: How to do so efficiently and without swamping user with too much irrelevant information?

A user asks: “Has the machine that m1 scheduled the job j to run started running it?” State of m1 State of m2 … 9998 more Job j received here & scheduled to run on m2 No info about job j …… Information system at central site Answer without recency reporting: NOT YET Reporting Recency Our idea is to only report: m1 last reported in at 09/12/2006 15:20, m2 last reported in at 09/11/2006 09:30 and nothing else. Naïve way to report recency: m1 last reported in at…, m2 last reported in at…, m3 last reported in at …… m10000 last reported in at …

Roadmap • Background • Definitions and Techniques • Prototype and Evaluation • Conclusion and Future Work

RDBMS Terminology Data source: an abstraction for a node being monitored. S streaming of facts Recency timestamp: the most recent time a data source reported in.

Goals • Completeness: all “relevant” data sources are in report. • Precision: reduce the number of “irrelevant” data sources included in report using efficient techniques.

lastReport table (source id, recency ts) • Other relevant relations (c1,c2,…, source id) Schema Model • Assumption: updates from a data source can only make changes to tuples with its own source id in the data source field.

Definitions Definition 1. If Q references a single relation R, we say a data source s is relevant if exists a potential tuple t for R from s, s.t. t satisfies Q’s predicates. Definition 2. For a query Q referencing relations R1, R2, …, Rn, we say that a data source s is relevant for Q if exists j and a potential tuple tj for Rj, and for any k ≠j, exists a tuple tk for Rk, such that these tuples together satisfies Q’s predicates. In this case we say that s is relevant for Q via Rj. Theorem 1. No single update from an irrelevant data source can change the result of a query

Table 1: An example instance for Activity Mach_id Value Event_time M1 Idle 03/11/2006 20:37:46 M2 Busy 02/10/2006 18:22:01 M3 Idle 03/12/2006 10:23:05 SELECTmach_idFROMActivityWHEREmach_idIN (‘m1’, m2’) ANDvalue = ‘idle’; The query result is {‘m1’}. The set of relevant data sources for the query is {‘m1’, ‘m2’}. Example 1 Suppose we keep track of machine activities in a table called Activity. The attributes of Activity are mach_id, activity value and the time when an activity value becomes valid. We treat the machine ID as the data source column. Activity(mach_id, value, event_time)

Table 3: An example instance for Activity Table 2: A sample instance for Routing Mach_id Value Event_time Mach_id Neighbor Event_time M1 Idle 03/11/2006 20:37 M1 M3 03/12/2006 23:20 M2 Busy 02/10/2006 18:22 M2 M3 02/10/2006 03:34 M3 Idle 03/12/2006 10:23 SELECTA.mach_idFROMRouting R, Activity AWHERER.mach_id = ‘m1’ANDR.neighbor = A.mach_idANDA.value = ‘idle’; The query result is {‘m3’} . The set of data sources relevant via R is {‘m1’}, the set of data sources relevant via A is {‘m3’} Example 2 Consider a P2P system where we use Routing to capture neighboring relationships. Mach_id is treated as the data source column. Activity is same as in example 1. Routing(mach_id, neighbor, event_time)

The Focused Method Query parts User Query Analyze Generate lastReport Recency query Evaluate Recency Report Evaluate Query Result System

PL/pgSQL table function recencyReport: accepts a user query, evaluates it and reports recency information • Usage spec: SELECT * FROMrecencyReport($$SQL TEXT$$); Recency Reporting Prototype • Recency information includes: • A temporary table name for exceptional relevant data sources and their recency timestamps • Another temporary table name for the other relevant data sources and their recency timestamps • The least recent data source and its recency timestamp • The most recent data source and its recency timestamp • Bound of inconsistency

Goals of Experiments • Our approach raises many questions: • Is it expensive to analyze user queries and generate focused recency queries? • Is it inefficient to evaluate the recency query in addition to each user query? • Does the focused recency query really succeed in reducing the number of irrelevant data sources in recency report? • Our experiments are an attempt to begin to answer these questions empirically.

Methods Evaluated • Naïve method: the recency query simply returns all data sources as being relevant • Focused method: with automatic generation of recency query • A variant of Focused method: with hardcoding of recency query

Evaluation Metrics • False positive rate: the percentage of the number of irrelevant data sources reported vs. the number of relevant data sources. • Response time overhead: time overhead for the additional recency reporting

Schema and Data • lastReport, Activity, Routing as in earlier examples • Synthetic data: total row count of Activity fixed at 10,000,000. Vary the number of data sources and number of rows per data source (data ratio) inversely.

Test Queries • Selection query and two-way join queries are measured • Q3: joins Routing and Activity with a very selective predicate on Routing • Q4: similar to Q3, but with a non selective predicate on Routing

Overhead Comparisons: Q3 Figure1: Q3’s performance overhead for recency and consistency reporting w.r.t data ratio and # of data sources ((data ratio ) × (# of data sources) =10,000,000).

High Overhead Region, Q3 Figure2: Response times for Q3 with and without recency report w.r.t data ratio and # of data sources ((data ratio)×(#of data sources)= 10,000,000). The Focused method with auto generation of recency query is used here.

Overhead Comparisons, Q4 Figure3: Q4’s performance overhead for recency and consistency reporting w.r.t data ratio and # of data sources ((data ratio ) × (# of data sources) =10,000,000).

Performance Evaluation Summary • The overhead for analyzing a user query and generating a recency query is insignificant. • The overhead for evaluating a recency query is insignificant and the focused method has less or equal cost than the naïve method unless • Data ratio is very low and • A query has a join and • The query is not selective on data sources.

False Positive Rates • Naïve method: depending on exact number of data sources, assuming 100,000 for illustration • Q3: fpr = (100000-6)/6 = 16665 • Q4: fpr = 6/(100000-6) = 0.00006 • Focused methods: all 0 because the precise sets of relevant data sources are found

Conclusion • Large scale asynchronous system: reporting recency, rather than enforcing it, is a viable solution • Defining “relevance” is non-trivial. Our solution: a data source is relevant if a single update from it will change the result of a query • Evaluation on our prototype showed that our methods incur insignificant overhead in most cases and are more precise than the naïve method

Future Work • Key constraints • Maintenance cost • Other definitions of relevance • A data source is “relevant” if N updates from it may change query result • A data source is “relevant” if a sequence of updates from it may change query result

Backup Slides

Another Example [1] 1. http://www.cs.wisc.edu/condor/map

Example 3 Table 1: Another sample instance for Activity, same Routing table instance as in Example 2 SELECTA.mach_idFROMRouting R, Activity AWHERER.mach_id = ‘m1’ANDA.value = ‘idle’ANDR.neighbor = A.mach_id; The query result is empty.The set of data sources relevant via R is Ø, the set of data sources relevant via A is {‘m3’}. Therefore m1 is not relevant, but two updates from m1 will change the query result: 1) m1 is updated to ‘idle’ in Activity, and 2) m1 is added as a neighbor of m1 itself in Routing.

Reporting recency • Recency information of relevant data sources are stored in a session duration temporary table • Possible that even the number of relevant sources will be large, so we provide additional summary information: the minimum recency timestamp, the maximum recency timestamp, and the range of recency • Also since machine failures are common in Condor, report exceptional data sources using z-score outlier detection

Experiment Setup • System: Tao Linux 1.0, 2.4 GHz Intel, 512MB memory • Database: PostgreSQL 8.0.0 • Shared buffer pool: 8MB • Working memory size: 1MB • Each query run 11 times and avg time of the last 10 runs is used to minimize fluctuation.

Test Queries • Q1: SELECTCOUNT(*)FROMActivity AWHEREA.mach_id IN (‘Tao1’,’Tao10’,’Tao100’, ‘Tao1000’,’Tao10000’,’Tao100000’) AND A.value = ‘idle’; • Q2: SELECTCOUNT(*)FROMActivity AWHEREA.mach_id NOT IN (‘Tao1’,’Tao10’,’Tao100’, ‘Tao1000’,’Tao10000’,’Tao100000’) AND A.value = ‘idle’; • Q3: SELECTCOUNT(*)FROMRouting R, Activity AWHERER.mach_id IN (‘Tao1’,’Tao10’,’Tao100’, ‘Tao1000’,’Tao10000’,’Tao100000’) AND R.neighbor = A.mach_id AND A.value = ‘idle’; • Q4: SELECTCOUNT(*)FROMRouting R, Activity AWHERER.mach_id NOT IN (‘Tao1’,’Tao10’,’Tao100’, ‘Tao1000’,’Tao10000’,’Tao100000’) AND R.neighbor = A.mach_id AND A.value = ‘idle’;

Roadmap • Background • Definitions and Techniques • Prototype and Evaluation • Conclusion and Future Work • Related Work

Enforcing currency and consistency • R. Alonso et al., Quasi-copies: Efficient data sharing for information retrieval systems. In EDBT, pages 443-468, 1988. • H. Garcia-Molina et al., Read-only transactions in a distributed database. ACM Trans. Database Syst., 7(2):209-234, 1982. • R. Lenz. Adaptive distributed data management with weak consistent replicated data. In SAC, pages 178-185, 1996. • A. Segev and W. Fang. Currency-based updates to distributed materialized views. In ICDE, pages 512-520, 1990. • A. Labrinidis et al., Balancing performance and data freshness in web database servers. In VLDB, pages 393-404, 2003. • L. Bright et al., Using latency-recency profiles for data delivery on the web. In VLDB, pages 550-561, 2002. • H. Guo et al., Relaxed currency and consistency: How to say “good enough” in SQL. In SIGMOD Conference, pages 815-826, 2004. A common theme here is to enforce recency constraints through a combination of choosing the correct version of an object to query (I.e., the cached or the primary copy) or refreshing “stale” objects by synchronously “pulling” new data in response to a query.

Data Lineage • Y. Cui et al., Lineage tracing for general data warehouse transformations. In VLDB, pages 471-480, 2001. Identify the set of source data items that produced a view item, our work is different in that even if a data source doesn’t contribute any lineage data items (possibly due to latency in reporting in or some error), it may still be “relevant”.

Distributed Query Processing • R. Munz et al., Application of sub-predicate tests in database systems. In A. L. Furtado and H. L. Morgan, editors, VLDB, pages 426-435, 1979. The problem statement relies on data items being placed in a distributed environment in such a way that they satisfy various predicates, and then they use the interaction of the data placement predicates and query predicates to identify where data satisfying a simple query might be located

Partitioning Pruning • D. J. DeWitt et al., Gamma - a high performance dataflow database machine. In VLDB, pages 228-237, 1986. • IBM. DB2 UDB for z/OS Version 8 Performance Topics, 2005. • Oracle Corporation. Oracle Database Concepts, 10g Release 1, 2003. Choose partitions by matching certain types of selection predicates with the “partitioning predicates”

Enhancing Reporting Consistency in Distributed Data Sources

Enhancing Reporting Consistency in Distributed Data Sources

Presentation Transcript

Outline

Distributed Database and Replication

Outline

Distributed Database and Replication

Outline

Outline

Outline

Outline

Outline

DISTRIBUTED DATABASE SECURITY

The VERIS framework Consistency in Reporting Data Breaches

Distributed Data Mining in Discovery Net

DISTRIBUTED DATABASE DESIGN

DDBMS

Consistency and Replication

MOCHA : A Self-Extensible Database Middleware System for Distributed Data Sources

Outline

Distributed Database Security

Distributed Database Systems

ICS362 – Distributed Systems