Protocol-level Reconfigurations for Autonomic Management of Distributed Network Services

Protocol-level Reconfigurations forAutonomic Management ofDistributed Network Services K. Ravindran and M. Rabby Department of Computer Science City University of New York (City College) ravi@cs.ccny.cuny.edu Presented at DANMS-NOMS 2012 conference (Maui, Hawaii) 16th April 2012

Organization of presentation • Service model to accommodate application adaptations when network and environment changes • Protocol-level control of QoS provisioning for applications • Dynamic protocol switching for adaptive network services • Meta-level management model for protocol switching • Case study of distributed network applications: (replica voting for adaptive QoS of information assurance) • Open research issues

OUR BASIC MODEL OF SERVICE-ORIENTED NETWORKS

Applications have the ability to: ^^ Determine the QoS received from system infrastructure ^^ Adjust its operational behavior by changing QoS expectations Adaptive distributed applications (e.g., airborne police networks, edge-managed Internet paths) system infrastructure notify resource changes Adjust QoS expectation Service-oriented protocol Application external environment incidence of hostile conditions notify QoS offering

Service-oriented distributed protocols: run-time structure protocol P(S) exports only an interface behavior to client applications, hiding its internal operations on the infrastructure resources from clients application access service S {q-a,q-b, . .} agents implementing service interface for S map protocol state onto service interface state protocol internal state p-1 p-2 p-3 asynchronous processes implementing protocol P(S) signaling messages exercise resources {rA,rB,rC, . .} Distributed realization of infrastructure ‘resources’ {rA, rB,rC, . .} : Resource control capabilities --- e.g., placement of mirror sites in a CDN {q-a, q-b, . .} : QoS parameter space --- e.g., content access latency in CDN

PROTOCOL !! What is our granularity of network service composition ? • A protocol exports only an interface behavior to client applications, hiding its • Internal operations on the infrastructure resources from clients • Examples: 1. ‘reliable data transfer’ service •  TCP is the underlying protocol • 2. ‘data fusion’ service •  multi-sensor voting is the underlying protocol • 3. ‘wide-area content distribution’ •  content push/pull across mirror sites is the underlying protocol • Given a network application, different types/variants of protocols are possible • (they exercise network resources in different ways, while providing a given service • A protocol good in one operating region of network may not be good in another region • “one size does not fit all” • choose an appropriate protocol based on the currently prevailing resource and environment conditions (dynamic protocol switching)

Management view of distributed protocol services Client application invoke service S(a) match QoS achieved (a’) with desired QoS (a) a: desired QoS parameters service interface (realized by agents) Service-level management module (SMM) reconfiguration policies, adaptation rules service binding invoke protocol service binding P1(S) P2(S) . . . . NETWORK SERVICE PROVIDER hostile external environment (e) p33 p13 p22 p12 p21 p11 protocol selection, QoS-to-resource mapping, . . exercise resources r=F(a,e) . . P1(S), P2(S) : Protocols capable of providing service S INFRASTRUCTURE RESOURCES pi1,pi2,pi3, . . : Distributed processes of protocol Pi(S), exercising the infrastructure resources --- i=1,2

Modeling of environment QoS specs a, protocol parameters par, network resource allocation R are usually controllable inputs In contrast, environment parameters e  E* are often uncontrollable and/or unobservable, but they do impact the service-level performance (e.g., component failures, network traffic fluctuations, etc) environment parameter space: E* = E(yk) E(nk)  E(ck) parameters that the designer does not currently know about parameters that the designer can never know about parameters that the designer knows about Protocol-switching decisions face this uncertainty

What is the right protocol to offer asustainable service assurance ? Service goals: Robustness against hostile environment conditions  Max. performance with currently available resources These two goals often conflict with each other !! A highly robust protocol is heavy-weight, because it makes pessimistic assumptions about the environment conditions  protocol is geared to operate as if system failures are going to occur at any time, and is hence inefficient under normal cases of operations A protocol that makes optimisticassumptions about environment conditions achieves good performance under normal cases, but is less robust to failures  protocol operates as if failures will never occur, and are only geared to recover from a failure after-the-fact (so, recovery time may be unbounded) Need both types of protocols, to meet the performance and robustness requirements

EXAMPLE APPLICATION 1: CONTENT DISTRIBUTION NETWORK

CONTENT DISTRIBUTION NETWORK U({p-a, p-b}) client 2 environment (E*) L: latency specs to CDN system L’: latency monitored as system output client traffic & mobility, content dynamics, . . client 3 latency monitor agent 3 content pages p-a content updates server R p-b pa pb Application layer [latency spec, content publish-subscribe, adaptation logic] sub(Pb) sub(Pa) client 1 p-b content server R clients c3 c1 c2 U({p-b}) agent 1 z control logic Content access service interface y q L L’ Layered View pull pa,pb push pa,pb u z Service-layer [adaptive algorithm for content push/pull to/from proxy nodes] q y x u U({p-a, p-b}) w x w v v agent 2 exercise resourcs p-a p-b Infrastructure interface z Network infrastructure [overlay tree as distribution topology, node/network resources] q Content-forwarding proxy node y u Content push/pull-capable proxy node Proxy-capable node & interconnection w x U({x}): update message for pages {x} v Local access link

Control dimensions application-level reporting & matching of QoS attributes (e.g., client-level latency adaptation, server-level content scaling) Management-oriented control of CDN exercisable at three levels adjust parameters of content access protocols (e.g., proxy-placement, choosing a push or pull protocol) Our study infrastructure resource adjustment (e.g., allocation more link bandwidth, increasing proxy storage capacity, increasing physical connectivity)

Client-driven update scheme (time-stamps without server query) PULL protocol client proxy X(S) server S request(p) GTS=1 (LTS=1,GTS=1) lc >> ls content(p) (local copy) lc:client access rate request(p) ls:server update rate content(p) TIME request(p) GTS=2 content(p) update_TS(p,2) (page changes) (LTS=1,GTS=2) request(p) get_page(p) update_page(p) content(p) (LTS=2,GTS=2) request(p) (updated local copy) . . content(p) GTS=3 (page changes)

Server-driven update scheme (PUSH protocol) server S client proxy X(S) lc << ls (page changes) (local copy) update_page(p) request(p) content(p) update_page(p) TIME update_page(p) update_page(p) update_page(p) update_page(p) request(p) content(p)

CDN service provider goals  Minimal service in the presence of resource depletions (say, less # of proxy nodes due to link congestion)  Max. revenue margin under normal operating conditions server-driven protocol (PUSH) and client-driven protocol (PULL) differ in their underlying premise about how current a page content p is when a client accesses p PUSH is heavy-weight (due to its pessimistic assumptions)  operates as if client-level accesses on p are going to occur at any time, and hence is inefficient when lc << ls PULL is light-weight (due to its optimistic assumptions) operates as if p is always up-to-date, and hence incurs low overhead under normal cases, i.e., lc >> ls

content distribution topology (simulated) content updates server R Content distributing node clients Content forwarding node clients content size: 2 mbytes link bandwidths: set between 2 mbps to 10 mbps read request 2.5 700 2.0 600 1.5 500 0 0 0 0 Normalized message overhead per read 400 x 0 latency incurred per read (msec) x pull 0 1.0 x x 300 0 push x x x 0 push 200 x 0 pull 0.5 0 x 0 100 0 0 0 x x x 0 0.0 0.0 0.02 0.04 0.06 0.08 0.1 0.0 0.02 0.04 0.06 0.08 0.1

Situational-context based proxy protocol control T(V’,E’) G(V,E) Base topology (from network map of US carriers): |V’|: 280 nodes; 226 client clusters Average # of hops traversed by a client request: 4 (a) o x 2.5 Normalized cost-measure (overhead) o A: greedy x o 2.0 o x 1.5 x A: genetic 1.0 0.5 0 5% 10% 20% 30% A: Optimization algorithm employed for computing proxy placement percentage of nodes used as content distributing proxies (b) client demographics, cloud leases, QoE, node/link status, . . parametric description of client workloads & QoS specs traffic bursts Context & Situational assessment module request arrivals from clients i, j, k (different content size/type) li, lj, lk task planning & scheduling Model-based estimation of overhead/latency node/link outages set of nodes & interconnects [G(V,E), costs, policy/rules, . .] signal g, stable e task events (based on combined client request arrival specs) plug-in of CDN model CDN simulator Controller place proxies V”V’ to reduce e [tree T(V’,E’) G(V,E), A] schedule tasks to resources at proxy nodes QoS specs g[L,O] error e= g-g’ [optimal methods for “facility placement” (greedy, evolutionary, . .)] observed QoS g’ state feedback (node/link usage)

EXAMPLE APPLICATION 2: MULTI-SENSOR DATA FUSION

Fault-tolerance in sensor data collection Modified 2-phase commit protocol (M2PC) environment (E*) device attacks/faults, network message loss, device asynchrony, . . USER QoS-oriented spec: data miss ratez how often[TTC > D]?? Data fusion application deliver data (say, d-2, later) sensor devices, data end-user data delivery service interface (data integrity & availability) vote collator (message- transport network) replica voting apparatus propose data Replica voting protocol (fault detection, asynchrony control) Voting service Layered View YES YES NO voter 2 voter N . . voter 1 voter 3 faulty d-1 d-2 d-3 d-N maintenance of device replicas (device heterogeneity, message security) raw data collected from external world sensors (e.g., radar units) infrastructure YES/NO: consent/dissent vote N: degree of replication fm: Max. # of devices that are assumed as vulnerable to failure (1fm< N/2 ) fa: # of devices that actually fail (0 fa fm) D: timeliness constraint on data; TTC: observed time-to-deliver data

Control dimensions for replica voting Protocol-oriented: 1. How many devices to involve 2. How long the message are 3. How long to wait before asking for votes . . QoS-oriented: 1. How much information quality to attain 2. How much energy in the wireless voting devices . . System-oriented: 1. How good the devices are (e.g., fault-severity) 2. How accurate and resource-intensive the algorithms are . .

A voting scenario under faulty behavior of data collection devices devices = {v1,v2,v3,v4,v5,v6}; faulty devices = {v3,v5} fa=2; fm=2 Had v3 also consented, good data delivery would have occurred at time-point A Had v3 proposed a good data, correct data delivery would have occurred at time-B collusion-type of failure by v3 and v5 to deliver bad data collusion-type of failure by v3 and v5 to prevent delivery of good data random behavior of v3 and v5 v2,v4 dissent; v6, v5 consents; omission failure at v3 v1,v2,v4,v6 dissent, v5 consents v3,v5 dissent; v1,v2,v4 consent K: # of voting iterations (4 in this scenario) A B C TIME write bad data in buffer by v3 write good data in buffer by v1 write good data in buffer by v6 deliver good data from buffer NO YES START attempt 1 (data ready at v6 but not at v2,v4) attempt 2 (data ready at v2,v4 as well) message overhead (MSG): [3 data, 14 control] messages attempt 3 TTC (time-to-complete voting round) • Malicious collusions among faulty devices: • Leads to an increase in TTC (and hence reduces data availability [1-z ]) • Incurs a higher MSG (and hence expends more network bandwidth B)

Observations on M2PC scenario ^^ Large # of control message exchanges: worst-case overhead = (2fm+1).N [too high when N is large, as in sensor networks]  Not desirable in wireless network settings, since excessive message transmissions incur a heavy drain on battery power of voter terminals In the earlier scenario of N=6 and fa=2, # of YES messages = 7, # of NO messages =12 ^^ Integrity of data delivery is guaranteed even under severe failures (i.e., a bad data is never delivered)  Need solutions that reduce the number of control messages generated

Solution 1: Selective solicitation of votes d d d d d d d d d vote(d,{4}) Poll only fm voters at a time (B specifies the voter list as a bit-map) Sample scenario for M2PC: N=5, fm=1 ( need YES from 2 voters, including the proposer) actual # of faulty voters: fa=1 B v1 v2 v3 v4 v5 B v1 v2 v3 v4 v5 d’ d’ d vote(d,{1,3,4,5}) TTC Y Y N N Y data proposed by v2 faulty faulty wasteful messages !! TTC vote(d,{3}) deliver d to end-user Y TIME data proposed by v2 deliver d to end-user ALLV protocol (pessimistic scheme) expends 5 messages total, K=1 iteration SELV protocol (optimistic scheme) expends 4 messages total, K=2 iterations

Analytical results N=8 (ALLV) = 7.0 fm (SELV) 1 1.29 : mean number of voting iterations per round 2 3.29 3 6.20 N=7 (ALLV) = 6.0 fm (SELV) 1 1.33 2 3.53 3 6.95 N=9 (ALLV) = 8.0 fm (SELV) 1 1.25 2 3.11 3 5.70 4 9.16 N=6 (ALLV) = 5.0 N=5 (ALLV) = 4.0 fm (SELV) fm (SELV) 1 1.40 1 1.50 2 3.90 2 4.50

Solution 2: Implicit Consent Explicit Dissent(IC-M2PC) mode of voting NO NEWS IS GOOD NEWS !! A voter consents by keeping quiet; dissents by sending NO message (in earlier scenario, saving of 7 YES messages) IC-M2PC mode lowers control message overhead significantly when: ^^s(Tp) is small many voters generate data at around the same time Tp ^^fm« N/2  only a very few voters are bad (but we don’t know who they are !!) worst-case control message overhead: O(fm .N ) for 0 < c < 1.0 c Employ implicit forms of vote inference depends on choice of vote solicitation time

Protocol-level performance and correctness issues Under strenuous failure conditions, the basic form of IC-M2PC entails safety risks (i.e., possibility of delivering incorrect data) normal-caseperformanceis meaningless unless the protocols are augmented to handle correctness problems may occasionally occur !!

IC-M2PC mode 2.T net T : maximum message transfer delay net buffer manager voter 1 (good) voter 2 (good) voter 3 (bad) M2PC mode(reference protocol) VOT_RQ (d’) d’ d1 d2 buffer manager voter 1 (good) voter 2 (good) voter 3 (bad) NO NO VOT_RQ (d’) d1 d’ d2 decide to deliver d’ to user NO YES ‘safety’ violation !! NO decide to not deliver d’ to user optimistic protocol (i.e., ‘NO NEWS IS GOOD NEWS’) ^^ very efficient, when message loss is small, delays have low variance, and fm << N/2 --- as in normal cases ^^ need voting history checks after every M rounds before actual data delivery, where M > 1 message overhead:O(N.fm/M); TTC is somewhat high message overhead:O(N^2); TTC is low

Dealing with message loss in IC-M2PC mode How to handle sustained message loss that prevent voter dissents from reaching the vote collator?? ^^ Make tentative decisions on commit, based on the implicitly perceived consenting votes ^^ Use aggregated `voting history’ of voters for last M rounds to sanitize results before final commit (M > 1) 1. If voting history (obtained as a bit-map) does not match with the implicitly perceived voting profile of voters, B suspects a persistent message loss and hence switches to the M2PC mode 2. When YES/NO messages start getting received without a persistent loss, B switches back to IC-M2PC mode  Batched delivery of M “good” results to user  “bad” result never gets delivered (integrity goal)

“history vector” based sanitization of results voters TIME d-i : result tentatively decided in round i under ICEDmode V-x V-y (faulty) V-z buffer manager round 1 round 2 (say, dissent from V-x was lost) round 3 IC-M2PC round 4 depict incorrect decisions (y: YES; n: NO) *: voter was unaware of voting (due to message loss) [y,n,y,y] [*,y,n,n] deliver d1,d3,d4 – and discard d2 (sporadic message loss) [y,n,y,y] round 5 (dissent from V-x and V-z lost) IC-M2PC round 6 (dissent from V-x and V-z lost) [n, n ] discard d5 and d6 (suspect persistent message loss) omission failure X [n, n ] switch modes round 7 consent and dissent messages are not lost (so, message loss rate has reduced) M2PC round 8 round 9 M2PC Non-delivery of data in a round, such as d2, is compensated by data deliveries in subsequent rounds(`liveness’ of voting algorithm in real-time contexts) M=4 M=2

Control actions during voting num_YES/NO: # ofvoters from which YES/NO responses are received for data in proposal buffer tbuf M2PC mode: if (num_YES fm) deliver data from tbuf to user IC-M2PC mode upon timeout 2T since start of current voting iteration if (num_NO < N-fm) optimistically treat data in tbuf as (tentatively) deliverable if (# of rounds completed so far = M) invoke “history vector”-based check for last M rounds Both M2PC and IC-M2PC if num_NO (N-fm) discard data in tbuf if (# of iterations completed so far < 2fm) proceed to next iteration else declare a ‘data miss’ in current round

Experimental study to compare M2PC and IC-M2PC IC-M2PC M2PC fm: Assumed # of faulty devices N=10; # of YES votes needed = 6; s(Tp)=50 mec; m(Tp)=50 msec; data size = 30 kbytes control message size = 50 bytes network loss l=0% network loss l=2% network loss l=4% 450 z 400 TTC (in msec) z z 300 z z z z z x x x z z x z x x x 200 x z x x x x 150 0 DAT overhead (# of data proposals) 3 z x z x z x z x x z z x z x x x 2 z x x z z x z z 1 0 25 x x CNTRL overhead (votes, data/vote requests, etc) x 20 x x x x x x z z 15 z x x z z z 10 x z z z z z z 5 0 1 2 3 4 1 2 3 4 1 2 3 4 fm fm fm

analytical results of IC-M2PC from probabilistic estimates To keep z < 2%, fm=1-3 requires l < 4%; fm=4 requires l <1.75%. N=10, Q=5, s(Tp)=50 msec, Tw=125 msec (Q: # of YES votes awaited in IC-M2PC mode) 0.25 Establishes the mapping of agent-observed parameter z onto infrastructure-internal parameters l and fa 0.20 fm=4 data miss rate at end-user level (z ) X 10^2 % 0.15 0.10 fm=3 0.05 fa: actual # of failed devices (we assume that fa=fm) fm=2 fm=1 0.0 0.0 0.02 0.04 0.06 0.08 message loss rate (l) X 10^2 % Sample switching between M2PC and IC-M2PC modes EXPLICITmode 30 EXPLICITmode protocol mode Number of messages 20 10 IMPLICIT mode IMPLICIT mode IMPLICIT mode 0 sustained attacks 10% changes in network state 6% message loss rate in network 2% 0 0% 0 168 337 509 680 856 1027 1203 Time (in seconds)

Situational-context based replica voting control scripts & rules situation assessment module SI: system inputs protocol designer voting QoS manager external environment parameters (fm) data-oriented parameters (size, s, m) [fault-severity, IC-M2PC/M2PC] E* SI data delivery rate g=[1-z] Replica voting protocol user system output observe data miss rate (z’) IA application BR B & N controller g-g’ . . SI v1 v2 vN g’=1-z’ Global application manager QoS of other applications

OUR MANAGEMENT MODEL FOR AUTONOMIC PROTOCOL SWITCHING

normalized cost incurred by protocol q (a’) protocol p2(S(a)) e r = F (a,e) p2 protocol p1(S(a)) r = F (a,e) p1 F (a,e): policy function embodied in protocol p to support QoS a for service S p Observations: ^^ Resource allocation r =F(a,e) increases monotonically convex w.r.t. e ^^ Cost function q (a’) is based on resource allocation r under environment condition e [assumeq (a’)=k.r for k > 0] e e e.g., ‘reliable data transfer’ service e packet loss rate in network (‘goback-N’ protocol is better at lower packet loss rate; ‘selective repeat’ protocol is better at higher packet loss rate) MACROSCOPIC VIEW ‘resource cost’ based view protocol behavior a’: actual QoS achieved with resource allocation r (a’  a) protocol p1 is good protocol p2 is good External event e higher value of e more hostile environment

Degree of service (un)availability is also modeled as a cost penalty measure for “service degradation” 1.0 utility value of network service u(a’) higher value of a better QoS 0.0 0.0 service-level QoS enforced (a’) Areq Amin Amax penalty measured as user-level dissatisfaction a’ 0.0 net penalty assigned to service = k1.[1-u(a,a’)] +k2.q (a’)fork1, k2 > 0 e infrastructure resource cost for providing service-level QoS a’ user displeasure due to the actual QoS a’ being lower than the desired QoS a [r=F(a’,e)]

Optimal QoS control problem Consider N applications (some of them mission-critical), sharing an infrastructure-level resource R with split allocations r1, r2, . . . , rN Minimize: total resource costs (split across N applications) displeasure of i-th application due to QoS degradation : QoS achieved for i-th application with a resource allocation ri : desired QoS achieved for i-th application

Policy-based realizations in our management model • ‘outsourced’ implementation Network application requests policy-level decision from management module (say, a business marketing executive, or a military commander may be a part of management module) • User-interactive implementation Application-level user interacts with management module to load and/or modify policy functions

Design issues in supporting our management model^^ Prescription of ‘cost relations’ to estimate projected resource costs of various candidate protocols^^ Development of ‘protocol stubs’ that map the internal states of a protocol onto the service-level QoS parameters^^ Strategies to decide on protocol selection to provide a network service^^ Engineering analysis of protocol adaptation/reconfiguration overheads and ‘control-theoretic’ stability during service provisioning (i.e., QoS jitter at end-users) ^^ QoS and security considerations, wireless vs wired networks, etc

Protocol-level Reconfigurations for Autonomic Management of Distributed Network Services

Protocol-level Reconfigurations for Autonomic Management of Distributed Network Services

Presentation Transcript

SNMP Simple network management protocol

Simple Network Management Protocol

Introduction of Distributed Services Network

SNMP ( Simple Network Management Protocol ) based Network Management

SNMP ( Simple Network Management Protocol ) based Network Management

Distributed Autonomic Management (DAM)

Protocol-level Reconfigurations for Autonomic Management of Distributed Network Services

SNMP (Simple Network Management Protocol)

Simple Network Management Protocol

Autonomic distributed systems

Distributed Routing Protocol in Wireless Network Simulation

Goals for a Configuration Management Network Protocol

SNMP ( Simple Network Management Protocol ) based Network Management

The Most Renowned Protocol for Network Management

Simple Network Management Protocol

SIMPLE NETWORK MANAGEMENT PROTOCOL (SNMP)

Simple Network Management Protocol

SNMP ( Simple Network Management Protocol ) based Network Management

SNMP ( Simple Network Management Protocol ) based Network Management