Got Predictability? Experiences with FT Middleware

Got Predictability?Experiences with FT Middleware Tudor DumitraşPriya Narasimhan Carnegie Mellon University

Who Needs Predictability? • Service-level agreements • Problem determination, fingerpointing • Self-management, autonomic computing FT-middleware protects the critical parts of IT infrastructures • Higher predictability requirements

Predictability of Fault-Tolerant Middleware • Faults are inherently unpredictable • What about the fault-free case? • Reportedly, max (response time) >> average (response time)

Empirical Data Collected • MEAD Trace: micro-benchmark (client-server) • Middleware for Embedded Adaptive Dependability • Fault-Tolerant CORBA implementation • 1200 configurations • FTDS Trace: 7 macro-benchmarks (3-tier applications) • Developed during Fault-Tolerant Distributed Systems class • Enterprise applications: online gaming, e-commerce • Use CORBA or EJB • 336 configurations Available at: http://www.ece.cmu.edu/~tdumitra/FT_traces/

Max Fault-Free Latency Fault-Free vs. Faulty Unpredictability (MEAD Trace) 2 Average Recovery Time 1.5 Recovery Time [s] 1 0.5 0 1 4 7 10 13 16 19 22 Number of Clients

13.6 s Max Fault-Free Latency Fault-Free vs. Faulty Unpredictability (FTDS Trace) 1.2 Fault Detection & Fail-over Fault Detection 1 Fail-over Request Processing 0.8 Recovery Time [s] 0.6 0.4 0.2 0 1 2 3 4 5 6 7 Project

Outline • Can we predict the maximum latency of FT middleware? • When do high latencies occur and how high are they? • How common are the high latencies? • Do most requests have bounded latencies?

Interface to application / CORBA(modified system calls) Tunability Tunable mechanisms Replicatedstate Replicationstyle #replicas Interface to Group Communication MEAD Architecture Srv The Replicator Replicated Server Replicated Client C Srv Cli R Group Communication C C R Client R Server Host OS CORBA CORBA Replicator Replicator Networking Host OS Host OS Networking Active Replication: all replicas process requests Passive Replication: primary replica processes requests

Su-Duel-Ku Competitive Sudoku FTEX Electronic stock exchange Mafia Online game Ticket Center Online ticketing Blackjack Online casino eJBay Online auctioning Park’n Park Parking-lot management Applications from the FTDS Trace EJB CORBA Passive Replication Active Replication

Architecture of FTDS Applications

4 x 10 4 -4 x 10 x 10 1.8 8 2 1.8 1.6 7 1.6 1.4 6 1.4 1.2 5 1.2 1 Latency [μs] Latency [μs] PDF 4 1 0.8 0.8 3 0.6 0.6 2 0.4 0.4 1 0.2 0.2 0 5 10 15 20 25 30 35 0 0 0 0.5 1 1.5 2 Time [s] Latency [μs] 4 x 10 Example of Unpredictability Maximum latency can be orders of magnitude larger than the average

7 10 6 10 Average latency [μs] 5 10 4 10 3 10 65536 5000 4096 4000 256 3000 2000 16 Request size [bytes] 1000 0 Request rate [req/s] Unpredictability in the MEAD Trace

7 10 7 10 6 10 6 10 Average latency [μs] 5 10 5 10 4 10 4 10 3 10 3 10 65536 65536 5000 4096 4000 5000 4096 256 3000 4000 2000 16 3000 256 Request size [bytes] 1000 2000 16 0 1000 Request rate [req/s] Request size [bytes] Request rate [req/s] 0 Unpredictability in the MEAD Trace Maximum latency [μs]

2 4 10 3.5 3 2.5 2 Maximum latency [s] 1.5 SuDuelKu 1 FTEX Park’n Park 0.5 Ticket Center 0 0 1 2 3 4 Average latency [s] Average and Maximum Latency MEAD 1 10 0 10 Maximum latency [s] -1 10 MEAD -2 10 -3 10 -2 0 2 10 10 10 Average latency [s]

-4 4 x 10 x 10 8 2 1.8 7 1.6 6 1.4 5 1.2 4 1 PDF Latency [μs] 0.8 3 0.6 2 0.4 1 0.2 0 0 0 0.5 1 1.5 2 Latency [μs] 4 x 10 Statistical Analysis of Unpredictability

300 200 Maximum z-score 100 0 Correlation with Message Size (MEAD) 1.5% 1% Percentage of outliers 0.5% 0% 16 256 4096 16384 65536 Size of reply messages [bytes]

Time in Kernel and User Mode (MEAD) • 25% kernel mode 16 KB and 64 KB • 10% kernel mode 16 B, 256 B and 4 KB

150 100 Maximum z-score 50 0 Number and Size of Outliers (FTDS) 3% 2% Percentage of outliers 1% 0% SuDuelKu FTEX eJBay Mafia Ticket Center Blackjack Park’n Park FTDS Project

60 50 40 30 Maximum z-score 20 10 0 Correlation with Number of Clients (FTDS) SuDuelKu 6% 5% 4% 3% Percentage of outliers 2% 1% 0% 1 4 7 10 Clients

60 50 40 30 Maximum z-score 20 10 0 Correlation with Request Rate (FTDS) FTEX 6% 5% 4% 3% Percentage of outliers 2% 1% 0% 5 10 15 20 25 Request rate [req/s]

Outlier Distribution (MEAD) 1200 1000 800 600 Experiments 400 200 0 0% 1% 2% 3% 4% 5% 6% Outliers per Experiment

Outlier Distribution (Comparison) 1 Ticket Center eJBay 0.8 Park’n Park 0.6 Blackjack FTEX Mafia Probability Density 0.4 0.2 SuDuelKu 0 0% 1% 2% 3% 4% 5% 6% Outliers per Experiment

Isolating the Unpredictability (MEAD)

Isolating the Unpredictability (MEAD) The “haircut” effect of removing 1% of the highest latencies

The Magical 1% Unpredictability seems to be confined to 1% of the remote invocations.

0.4 2 6 Park’n Park 1.5 4 SuDuelKu MEAD 0.3 Blackjack Mafia 1.5 4 3 0.2 1 Latency [s] 1 Latency [s] 15 2 0.1 2 FTEX 0.5 0.5 1 0 10 10 20 30 40 0 0 Latency [s] 200 10 400 20 600 30 800 1000 40 1200 0 0 10 10 20 20 30 30 40 40 Experiment 5 0 10 20 30 40 Experiment th 99 percentile Average latency Maximum latency Magical 1%

[ ] [ ] th 99 percentiles [ ] [ ] Confidence interval [ ] [ ] [ ] [ ] Bounds for the 99th Percentile MEAD Latency range Ticket Center Park’n Park Mafia eJBay FTEX Blackjack SuDuelKu 0 40 80 120 160 200 240 Z-Scores of Latency

7 10 6 10 s] m 5 10 99% latency [ 4 10 3 10 65536 16384 5000 4096 4000 256 3000 2000 16 1000 0 Request size [bytes] Request rate [req/s] Trends for the 99th Percentile (MEAD)

Summary • Can we predict the maximum latency of FT middleware? • Not always; maximum usually not correlated with average • When do high latencies occur and how high are they? • Usually not correlated with configuration parameters, OS metrics • Comparable with recovery time after crash faults • How common are the high latencies? • Confined to 1% of remote invocations • Do most requests have bounded latencies? • 99% of requests have a z-score < 10

Implications of the Magical 1% • Predictable maximum latencies are hard to achieve • Cannot eliminate high latencies by carefully configuring the system • Statistical predictability is easy to achieve • 99th percentile latency bounded with high confidence • Confirmed for different • Applications • Programming languages • Middleware technologies • Replication mechanisms • Operating systems • Not confirmed for WANs, wireless networks • Statistical predictability is relevant for many enterprise applications

Thank You! For more information: http://www.ece.cmu.edu/~tdumitra

MEAD Trace vs. FTDS Trace

MEAD Test bed Emulab 100 Mb/s LAN Pentium III at 850 MHz Parameters varied Replication style: active, passive Replication degree: 1, 2, 3 replicas Number of clients: 1 – 22 Think time: 0, 0.5, 2, 8, 32 ms Reply size:16 B, 256 B, 4 KB, 16 KB, 64 KB FTDS Test bed Undergraduate cluster 100 Mb/s LAN Pentium IV at 2.4 GHz Parameters varied Clients: 1, 4, 7, 10 Think time: 0, 20, 40 ms Reply size: original, 256 B, 512 B, 1 KB Experimental Setup

Client Server Application out in client server out in ORB in out interc_hi in out Replicator interc_lo out in out in reply request Group Communication Sources of Unpredictability

State State Request Response State Transfer Passive Replication Passively Replicated Server Object Passively Replicated Client Object Primary Replica Primary Replica ORB ORB ORB ORB ORB Client Group Server Group

Duplicate Invocation Suppressed Duplicate Responses Suppressed Active Replication Actively Replicated Server Actively Replicated Client ORB ORB ORB ORB ORB Client Group Server Group

400 Maximum z-score 200 0 Correlation with Number of Clients (MEAD) 1% Percentage of outliers 0.5% 0% 1 4 7 10 13 16 19 22 Number of clients

Minor Page Faults (MEAD)

Outlier Distribution (MEAD) 1200 1000 800 600 Experiments 400 200 0 0% 1% 2% 3% 4% 5% 6% Outliers per Experiment

Got Predictability? Experiences with FT Middleware

Got Predictability? Experiences with FT Middleware

Presentation Transcript