280 likes | 396 Vues
Achieving fast (approximate) event matching in large-scale content-based publish/subscribe networks. Yaxiong Zhao and Jie Wu The speaker will be graduating next summer. Outline. Background Content-based pub/sub networks Counting algorithm The design Problem formulation
 
                
                E N D
Achieving fast (approximate) event matching in large-scale content-based publish/subscribe networks Yaxiong Zhao and Jie Wu The speaker will be graduating next summer
Outline • Background • Content-based pub/sub networks • Counting algorithm • The design • Problem formulation • Index tree based discretization • Potential improvements • Performance evaluation
Outline • Background • Content-based pub/sub networks • Counting algorithm • The design • Problem formulation • Index tree based discretization • Potential improvements • Performance evaluation
Content-based pub/sub networks Pub/sub provides hustle-free messaging between users Title = {xyz} ᴧconference = ICDCS ᴧ keywords = {content} Content-based pub/sub (CBPS) provides yet more expressiveness Message • Content representation model: how to describe contents? • Boolean expression model • Attribute constraints: a primitive description of the constraints on a attribute’s value • Asubscription/filter is defined as a conjunction of multiple attribute constraints • Each message has multiple attribute assignments Pub/sub routing Subscription conference = ICDCS ᴧ keywords ∈{content, pub, sub}
Preliminary: counting algorithm • Since each subscription is a conjunctive form of multiple attribute constraints • A message matches a subscription i.f.f. it matches all attribute constraints of the subscription • Counting algorithm works by counting the number of matched attribute constraints • Matches if the number is equal to the number of all the attribute constraints of the subscription Title = {xyz} ᴧconference = ICDCS ᴧ keywords = {content} Matches 2 attribute constraints so it matches the subscription conference = ICDCS ᴧ keywords ∈{content, pub, sub}
Outline • Background • Content-based pub/sub networks • Counting algorithm • The design • Problem formulation • Index tree based discretization • Potential improvements • Performance evaluation
Dissecting the counting algorithm • The algorithm goes through two steps • Retrieving matched attribute constraints • The most time consuming stage • The focus of this paper • Comparing the number of matched constraints with the number of constraints for each subscription and return matched subscription • It’s possible to shortcut this process
Optimizing the retrieval stage • The naïve approach, i.e. examining all attribute constraints, only works for very small amount • A few thousand • More intelligent techniques • Binary search to eliminate unmatched constraints • SIENA (Sigcomm’03) • Clustering possibly-matched constraints • Faster matching (Sigmod’01)
Range-based attribute constraints • Range-based attribute constraints represent an cell that an attribute’s value should be in • Can be seen as a conjunction of two primitive attribute constraints • General enough to replace primitive attribute constraints • A primitive constraint can be translated to a range-based constraint • Range-based constraints are highly desirable Height > 100 & Height < 200  Height ∈ (100, 200) Height < 200 => Height > 0 & Height < 200
Basic idea • Reverse indexing of subscriptions • Instead of check each subscription to see what assignments match it • Directly retrieve the matched attribute constraints for each attribute assignment We need a data structure to mapping between attribute value and attribute constraints
Outline • Background • Content-based pub/sub networks • Counting algorithm • The design • Problem formulation • Index tree based discretization • Potential improvements • Performance evaluation
Reverse indexing • Indexing subscriptions through their represented ranges for each attribute • [100 < height < 200] • [100, 200] <-> subscription1 • Given an assignment “height = 150” we can find all its matched subscriptions by find the ranges containing this value
Discretization • Represent an arbitrary range by evenly separated cells • Mapping between value and its corresponding cell is fast • Results in false positive/negative (approximate matching) • We only accept false positives • Guarantee user satisfaction [0, 1, 2] is used to represent this attribute constraint
Index tree and subscription discretization • The naïve discretization has a scalability issue • The worst-case false positive is • 2 * cell-length / range-length • To achieve a false positive of p • The number of cell is 1/p • An analog is counting numbers • Need exactly “n” tokens to represent the number “n”
A very simple remedy • Just like counting in positional notation • Binary separating the attribute value space • Each cell is evenly divided into two in the next level • Log(1/p) cells • Much better scalability {level 2 : 1; level 3: 1, 4; level 4 : 1, 10}
Working with counting algorithm • Each range attribute constraint is discretized on multiple levels • Each cell ID associates with a subscription ID • Retrieval stage • Table lookup • Counting stage • Incremental the counters for subscription IDs • Comparing counters’ values to the number of attribute constraints of subscription cell IDs Attributes Subscription IDs levels
Implementation • Data structure organization • Matching table: for discretized cell ID and subscription ID mapping • Subscription ID and attribute constraint counts mapping • Linear table • More details are in the paper
Dynamic matching for shortcutting matching process • In many situations we do not need to find all matched subscriptions • Interface matching • Stop matching once any one of the associated subscription matches • Just examine a fraction of the discretization levels
Optimal binary separation • The above analysis assumes uniform distribution of the attribute values of events • The analysis holds for non-uniform distribution only when • Event values are evenly distributed on each cell • I.e. the number of events fall into each cell is the same • Optimal binary separation does this • Bisecting a range at its median • Ensure that each cell contains the same amount of events If 90% event’s attribute values fall into here
Outline • Background • Content-based pub/sub networks • Counting algorithm • The design • Problem formulation • Index tree based discretization • Potential improvements • Performance evaluation
Eliminating false positives • Two types: interface false and subscription • Matches a wrong interface • Matches a wrong subscription • Situations that no false positive occurs • Interface/subscriptions matched before the last discretization level • Can be used to short-cutting interface matching • To eliminate false positives • Double check the subscriptions that are matched at the last discretization level
Outline • Background • Content-based pub/sub networks • Counting algorithm • The design • Problem formulation • Index tree based discretization • Potential improvements • Performance evaluation
Experiment settings • A working prototype written in C++ • As a forwarding component in Siena • Total number of attributes is 1,000 • Number of attributes per event 100 – 1,000 • Relative width of attribute constraints 0.01 – 0.1
Subscription matching time (degenerate case) • With of attribute constraints: 0.01 (relative to the entire value space) • 10 range constraints per subscription • Each event has 100 attribute assignments • 3.23ms to return all matched subscription for an event with 20 million attribute constraints • Orders of magnitude faster than Siena
Interface matching speeding w/ vs w/o shortcutting • Fix the total number of subscriptions to 20,000 • Vary the number of subscriptions (filters) per interface • Present the changes of interface matching with two different widths of attribute constraints Relative value of the width of attribute constraints
Subscription matching FPR • Stores 20,000 subscriptions • Vary the width of attribute constraints and the number attributes per subscription • 108 events • Feed 10,000 events to the matching table
Interface matching FPR • 20,000 subscriptions • 2,000 subscriptions per interface • Change the width of range constraints and the number of attributes per subscription
Q&A Thank you for listening! yaxiong.zhao@temple.edu Drop me an Email if your questions are not answered here