Increasing Intrusion Tolerance Via Scalable Redundancy

Increasing Intrusion Tolerance Via Scalable Redundancy Mike Reiter reiter@cmu.edu Natassa Ailamaki Greg Ganger Priya Narasimhan Chuck Cranor

Technical Objective • To design, prototype and evaluate new protocols for implementing intrusion-tolerant services that scale better • Here, “scale” refers to efficiency as number of servers and number of failures tolerated grows • Targeting three types of services • Read-write data objects • Custom “flat” object types for particular applications, notably directories for implementing an intrusion-tolerant file system • Arbitrary objects that support object nesting

Expected Impact • Significant efficiency and scalability benefits over today’s protocols for intrusion tolerance • For example, for data services, we anticipate • At-least twofold latency improvement even at small configurations (e.g., tolerating 3-5 Byzantine server failures) over current best • And improvements will grow as system scales up • A twofold improvement in throughput, again growing with system size • Without such improvements, intrusion tolerance will remain relegated to small deployments in narrow application areas

The Problem Space • Distributed services manage redundant state across servers to tolerate faults • We consider tolerance to Byzantine faults, as might result from an intrusion into a server or client • A faulty server or client may behave arbitrarily • We also make no timing assumptions in this work • An “asynchronous” system • Primary existing practice: replicated state machines • Offers no load dispersion, requires data replication, and degrades as system scales in terms of # messages

Evaluation • Baseline for current work: the BFT library • Popular, publicly available implementation of Byzantine fault-tolerant state machine replication (by Castro & Liskov) • Reported to be an efficient implementation of that approach • Two measures • Average latency of operations, from client’s perspective • Peak sustainable throughput of operations • Our consistency definition: linearizability of invocations

Background - Read/Write protocol • Servers provide read/write block interface • Servers version blocks on every write • Decentralized, optimistic, scalable, Byzantine fault-tolerant Servers D D D D D D D D Data block Client

R/W semantics • R/W protocol appropriate for block storage • But R/W protocol inappropriate for building general services • Doesn’t provide replicated state machine semantics • A metadata service for a R/W-based block store motivated us to develop a protocol with stronger semantics

Client A Directory Directory R/W semantics insufficient for metadata • Consider 2 clients inserting a file in the same directory • Last write wins; good for blocks, bad for directories D D D D D D D D D D D D Directory Directory Directory Client B

Query/Update (Q/U) protocol • A protocol with replicated state machine semantics • Provides linearizable query and update operations • Protocol properties • Decentralized • Handles Byzantine clients & server failures, asynchronous • Efficient common case operation • Optimistic protocol leverages versioning servers • Single-phase queries and updates, if concurrency- and failure-free • Avoids expensive cryptography (digital signatures) • Scalable • Avoids server-to-server broadcast • Atomic multi-object updates

Outline • Motivation • Query/Update protocol • Overview • Query, update operations • Validation, object syncing, multi-object operations • Evaluation

Client A Directory Directory Read/conditional-write primitive • Servers accept an update operation only if the object hasn’t been modified since read directory D D D D D D D D Directory Directory Client B

Directory Handling Byzantine clients • For Byzantine fault-tolerance, clients must pass operation to servers • Constrains clients to narrow object interface • Servers apply operation to old object to validate new object directory D D D D D D D D directory + op Op Op Op Op Directory Op

Clients and objects • Client just sends operations • Client does not read/write object • Server applies operation to local object history D D D D D D D D op Op Op Op Op

B 1 0 A 5 4 3 Query/Update protocol • Servers host objects • Optimistic protocol  versioning • Export an operation interface (more than read/write) • Can export any deterministic operation • Server exports three types of operations: Server Read History (object) Returns timestamp vector Query (Object,Version) Read-only; returns object state; e.g., getattr Update (Object, OHS, Value) Mutating; updates object, conditioned on object not having been modified; e.g., setattr C 9 8

Outline • Motivation • Query/Update protocol • Overview • Query, update operations • Validation, object syncing, multi-object operations • Evaluation

Read history operation • Client requests version history of an object • Each server replies with a list of timestamps 3 read-history history-reply 2 2 2 2 Time 2 1 1 1 1 1 Object History Set (OHS)

2 2 2 2 Latest Query operation • Client performs read history operation • Constructs OHS and identifies Latest version that is complete • Client queries Latest version at server 3 read-history history-reply 2 2 2 2 2 Time query 1 1 1 1 1 query-reply Object History Set (OHS)

OHS OHS OHS OHS Latest Update operation • Client performs read-history operation • Constructs OHS and identifies Latest version that is complete • Client sends operation and OHS to servers • Operation is conditioned on OHS 3 3 3 3 read-history 3 history-reply 2 2 2 2 Time 2 update 1 1 1 1 1 update-reply Object History Set (OHS)

Server validation for update operations • A server needs to verify that the client conditioned operation on Latest • Validation steps: • Ensure read/conditional-write semantics • Check that local history matches that in OHS • Classify Latest write version • Ensures operation is based on appropriate timestamp • Protection against Byzantine failures • Check authenticators • Ensures integrity of OHS

2 2 2 2 Server validation example • Earlier example of 2 clients concurrently updating same directory • Servers reject client B’s operation, due to “stale” OHS 3 read-history history-reply 2 Time update 1 1 1 1 1 Client B Client A

Q/U protocol details • Handling Byzantine clients and server faults • Through validating timestamps and OHS • During classification of Latest, may require repair • Incomplete operations: use barriers to fix failures • Flexible protocol – can handle different types/# of faults • For asynchronous with Byzantine clients: • N = 3t + 2b + 1, to tolerate t server faults, b of which are Byzantine • Object syncing • Multi-object operations

Object syncing • A server may not have the latest version of an object • If a server lacks latest version of object, the OHS contains information about which other servers have that version • The server must sync the object with another server • Hashes in OHS allow server to validate the synced object

Multi-object operation • An update can span multiple objects • A client must construct OHS for each object • Servers perform validation for each object • Operations perform atomically across multiple objects

Outline • Motivation • Query/Updateprotocol • Overview • Query, update operations • Validation, object syncing, multi-object operations • Evaluation

Prototype evaluation • Built a counter object using Q/U and BFT protocols • incmethod increments counter and returns new value • fetchmethod returns current counter value • Light-weight operations to demonstrate network and computation overhead inherent to protocols • Both Q/U and BFT implement efficient, optimistic queries • Evaluation focuses on updates • Q/U common case: no concurrency; preferred quorums • BFT common case: shared counter to allow batching

Experimental setup • Cluster of Pentium 4 2.8 GHz, 1GB RAM • 1 Gb switched Ethernet, 18.3 Gbps/35.7 mpps switch • No background traffic • Working size of experiments fit in server memory • To focus on protocol overhead, not on disk accesses • Experiments are run for 30 seconds • Measurements from middle 10 seconds

Fault scalability (1) • Investigate throughput as the number of server faults (b) tolerated increases • Measured saturated throughput • Ran with 1, 3, 5, …, 20 clients with 2 outstanding reqs • For each b, selected highest throughput value

Fault scalability (2)

Throughput and response time under load (1) • Investigate throughput & response time under load • Demonstrates protocol behavior beyond saturated throughput data point • Increased number of clients from 1 to 20 for b = 1

Throughput and response time under load (2)

Conclusions • Developed the Q/U protocol for accessing shared objects in a distributed system • Fault-scalable • Byzantine fault-tolerant • Optimistic, efficient • Atomic multi-object operations • Evaluation • Protocol scales with number of failures tolerated • Throughput & response time consistent under load

Increasing Intrusion Tolerance Via Scalable Redundancy

Increasing Intrusion Tolerance Via Scalable Redundancy

Presentation Transcript

Deploying Analytical Redundancy for System Fault Tolerance

“Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

Intrusion Tolerance for NEST

Intrusion Detection via Static Analysis

Attack/Intrusion Tolerance

Intrusion Tolerance Using Masking, Redundancy and Dispersion

Quantifier Elimination Via Clause Redundancy

INTERNET SECURITY: AN INTRUSION – TOLERANCE APPROACH

Intrusion Tolerance : The Killer App for BFT (?)

Low latency via redundancy

Intrusion Tolerance

Dependable Intrusion Tolerance

Intrusion Tolerance by Unpredictable Adaptation (ITUA)

Scalable Parallel Intrusion Detection

Intrusion Tolerance by Unpredictable Adaptation

Self Cleansing Intrusion Tolerance: An Approach for Increasing Security and Availability

Intrusion Tolerance for NEST

Dependable Intrusion Tolerance

Increasing Intrusion Tolerance Via Scalable Redundancy

Transient Fault Tolerance via Dynamic Process-Level Redundancy