230 likes | 402 Vues
On Load Shedding in Complex Event Processing. Authors: Yeye He Microsoft Research Siddharth Barman California Institute of Technology Jeffrey F. Naughton University of Wisconsin-Madison. Presenter (non-author): Arvind Arasu Microsoft Research .
E N D
On Load Shedding in Complex Event Processing Authors: Yeye He Microsoft Research SiddharthBarman California Institute of Technology Jeffrey F. NaughtonUniversity of Wisconsin-Madison Presenter (non-author): Arvind Arasu Microsoft Research
Overview • Background: Complex Event Processing (CEP) • A different stream processing model • Problem: Load shedding in CEP • Maximize utility under resource constraints • Focus of this work • A problem taxonomy, hardness, and approximations
Overview • Background: Complex Event Processing (CEP) • A different stream processing model • Problem: Load shedding in CEP • Maximize utility under resource constraints • Focus of this work • A problem taxonomy, hardness, and approximations
Background: CEP Data Model • CEP event data • Event stream S = (e1, e2, … ) • Each event eiis associated with an event type Ej • Each event ei has a time-stamp, t(ei) • Stream S is temporally ordered: t(ei) < t(ei+1), for all i a1 b2 c3 d4 a5 b6 c7 d8 The superscript of event to denote the time-stamp,e.g. t(a1) = 1 Each event is associated with a type, e.g. event a1is of type A A set of four event types = {A, B, C, D}
Background: CEP Query Model • CEP sequence query • Q = SEQ(E1, E2, ..Em), where Ek are event types • A time-based query window T(Q) • Only consider conjunctive queries in this work • An event sequence (, … ) is a query match of Q, if • Types match: is of type Ek for all k [m] • In query window: t() - t() T(Q) a1 b2 c3 d4 a5 b6 c7 d8 • Q1 = SEQ4(A, B) in 4 min • Q2= SEQ4(B, C) in 4 min • Q3= SEQ4(C, D) in 4 min Q1 Q1 Q3 Q3 Q2 Q2 Outside time-window Q1
The Load Shedding Problem • Event streams are often bursty • Not all events can be processed timely • Given resource constraints (CPU/memory) • Problem: Selectively “shed” data/processing • To preserve the most useful query results
Query Utility in CEP • Use query utility to quantify usefulness • Utility weight w(Qi) of query Qi to model importance a1 b2 c3 d4 a5 b6 c7 d8 • Q1 = SEQ4(A, B) • Q2 = SEQ4(B, C) • Q3= SEQ4(C, D) • , W(Q1)=3 • , W(Q2)=2 • , W(Q3)=4 Q3 Q1 Q3 Q1 W=4 W=3 W=4 W=3 Q2 Q2 W=2 W=2
Utility Maximizing Load Shedding • Given a set of queries {Qi} • Given expectedquery matches in unit time interval • Estimated using event arrival statistics • Find a type-level, global shedding strategy that • Maximize the expected utility • Respect resource constraints (Memory/CPU/Dual) • Integral: discard all events/queries of certain types • Fractional: discard randomly sampled events/queries of certain types
Why Expected Utility? • Online algorithms with competitive ratio? • Hopeless! • No algorithm can have competitive ratio better than , where is the length of the event sequence • Prove by using an adversarial scenario
An Adversarial Scenario • event types: • unit-weight queries: SEQ(), • Event sequence: () • is of type , • drawn from with equal probability • Memory budget = 2 events • Offline optimal: utility = 1 • pick one from based on X • Online optimal: expected utility = • Competitive ratio: Instead, we optimize utility in the expected sense
Resource Constraint: Limited CPU • Not all queries can be processed by CPU • E.g., CPU need to process 3 unit-cost queries (per 4 time units) • Unit-cost for simplicity, queries can have arbitrary costs • Suppose CPU can only process 2 queries • Best strategy: discard Q2, keep Q1 and Q3 (highest gain queries) a1 b2 c3 d4 a5 b6 c7 d8 • Q1 = SEQ4(A, B), W(Q1)=3 • Q2= SEQ4(B, C), W(Q2)=2 • Q3= SEQ4(C, D), W(Q3)=4 Q3 W=4 Q1 W=3 Q3 W=4 Q1 W=3 Q2 W=2 Q2 W=2
Resource Constraint: Limited Memory • Not all events can be kept in memory • E.g., need to keep 4 events in memory (in 4 time units) • Because query window = 4 • Suppose memory = 3 (per 4 time units) • Best strategy: keep B, C, D and discard A. U=+=6 • Discard D? U=+=5 • Discard B? U==4; Discard C? U==3 a1 b2 c3 d4 a5 b6 c7 d8 • Q1 = SEQ4(A, B), W(Q1)=3 • Q2 = SEQ4(B, C), W(Q2)=2 • Q3= SEQ4(C, D), W(Q3)=4 Q3 W=4 Q1 W=3 Q3 W=4 Q1 W=3 Q2 W=2 Q2 W=2
Integral Memory-bound LS (IMLS) • Negative results • NP-hard • Unlikely to be approximated within • Unless 3SAT • Reduction from Densest k-Sub-Hypergraph [1] Hajiaghayi, et al. The minimum k-colored subgraphproblem in Haplotypingand DNA primer selection. Bioinformatics Research and Applications, 2006
Integral Memory-bound LS (IMLS) • Positive results • A general bi-criteria approximation for utility loss minimization • optimal loss with budget • () bi-criteria approximation: utility loss is at most using memory • LP-rounding based algorithm
Integral Memory-bound LS (IMLS) • Positive results (cont’d) • Another approximate special case: • If the memory can hold at least 1/f number of queries • memory capacity is reasonably large • An event can be in at most number of queries • A -approximation algorithm • For utility gain maximization • Use Knapsack-like approach
Integral Memory-bound LS (IMLS) • Positive results (cont’d) • Pseudo-polynomial-time solvable special case • Multi-tenant CEP applications, co-locating on same server • Disjoint events for each application • Each application has no more than events • IMLS can be solved in time O() • : total # of events • : total # of queries • M: memory budget
Fractional Memory-bound LS (FMLS) • Negative result: • NP-hard even if each query has exactly two events • Positive result: • relative-approximation for utility gain maximization • If memory requirement of each event type exceeds total budget • controls precision (, ) • max number of event in a query • Use a grid-based approach on Simplex [2] [2] de Klerk, et al. A PTAS for the minimization of polynomials of fixed degree over the simplex. Theoretical Computer Science, 2006
Integral CPU-bound LS (ICLS) • Negative result • NP-complete • Positive result: • Admits an FPTAS: rounding off least significant bits • Use knapsack results • ICLS is an easy load shedding variant
Fractional CPU-bound LS (FCLS) • Positive result • Can be written as a simple Linear Program • Polynomial time solvable • FCLS is the easiest load shedding variant
Integral Dual-bound LS (IDLS) • Negative result: • NP-hard & inapproximable • same as IMLS • Positive result: • A tri-criteria approximation • optimal loss with memory budget & CPU budget • At mostutility loss using memory & CPU • LP-rounding based algorithm
Fractional Dual-bound LS (FDLS) • Negative result: • NP-hard even if each query has exactly two events • Same as FMLS since FDLS is a special case • Approximation: open problem • Non-convex optimization subject to non-convex constraints • We didn’t find good techniques for this
Conclusion and Future Work • Study the old problem of load shedding in the new context of CEP • Investigate six problem variants • Hardness & approximation (more results in the paper) • A rich problem with more to study • Delayed variants: instance-level optimization • Query language beyond positive event occurrence
Thank you! Questions?