Maximizing System Availability: Lessons from AT&T

IntroductionHigh-Availability Systems: An Example AT&T • Pioneered FT in telephone switching applications. • Aggressive availability goal: 2 hours downtime in 40 years (i.e., 3 min/year), with less than 0.01% of the calls handled incorrectly.

IntroductionHigh-Availability Systems: An Example AT&T In 1978, Bell Labs collected data on historic trends of causes of system downtime: • 20% attributed to HW (good diagnostics and trouble-location programs can help minimize HW-induced downtime). • 15% attributed to SW (SW deficiencies included improper translation of algorithms into code or improper specifications). • 35% attributed to recovery deficiencies (these deficiencies can be caused by undetected faults or incorrect fault isolation). • 30% attributed to human procedural error.

IntroductionHigh-Availability Systems: An Example AT&T Other studies on the same direction ...

There is some natural redundancy in the telephone switching network: “a telephone user will redial in he gets a wrong # or is disconnected”. However, there is a user aggravation level that must be avoided: “users will redial as long as it does not happen to frequently”. IntroductionHigh-Availability Systems: An Example AT&T

IntroductionHigh-Availability Systems: An Example AT&T Note, however, that the thresholds are different for failure to establish a call (moderately high) and disconnection of an established call (very low): Levels of recovery in a Telephone Switching System

IntroductionHigh-Availability Systems: An Example AT&T In a typical telephone switching system, tasks of the Central Control Unit are related with: • Overall system control/administration • Call processing • System maintenance • Automatic isolation of faulty units • Defensive SW strategies • Support for rapid repair

IntroductionHigh-Availability Systems: An Example AT&T Bus Interface Program Store (PS) Central Control (CC) Call Store (CS) AU Auxiliary Unit (AU) Bus Typical switching system diagram

IntroductionHigh-Availability Systems: An Example AT&T CC instructions reside in the program store (PS) while transient info (e.g., telephone calls, routing, equipment configuration) is held in the call store (CS) Auxiliary Unit (AU) Bus interfaces to disk and magnetic tape mass storage.

IntroductionHigh-Availability Systems: An Example AT&T PSB: Program Store Bus PU: Peripheral Unit Bus PUB1 PUB2 Bus Interface 1 Bus Interface 2 PSB1 PSB2 Program Store 1 (PS) Program Store 2 (PS) Central Control 1 (CC) Central Control 2 (CC) Call Store 1 (CS) Call Store 2 (CS) AU 1 AU 2 Auxiliary Unit (AU) Bus Duplex configuration for switching computer. (Assuming that only one of each component is required for a functional system, there are 64 possible system configurations.)

IntroductionHigh-Availability Systems: An Example AT&T 1-Both CCs operate in synchronism. Two matched circuits compare 24 bits of internal state during each 5.5us machine cycle. 2-There are 6 different sets of internal nodes that can be monitored, depending on the instruction being executed. 3-A mismatch generates an interrupt which calls fault recognition programs to determine which half of the system is faulty. 4-Information can be sample by the matchers and retained for later examination by diagnostic programs.

IntroductionHigh-Availability Systems: An Example AT&T 5-The OS employs Hamming code on the 37 data bits. 6-There is parity check bits over address plus data bus: the CS has one parity bit on address and data, and another parity bit just on address. 7-Both OS and CS automatically retry operations upon error detection. 8-After a fault has been detected, the system configuration logic attempts to establish various combinations of subunits. 9-A sanity program is then executed.

IntroductionHigh-Availability Systems: An Example AT&T Summarizing some features of the FT system: • Duplication of ALU. • 30% of Control Logic devoted to Self-Checking. • EDAC on disks. • SW audits. • Sanity timer (a Sanity Program is similar to a maze that the HW must traverse before the sanity timer times out. If a time-out occurs, the reconfiguration logic generates a new configuration to be tried).

IntroductionHigh-Availability Systems: An Example AT&T • Integrity monitor (Supervisor). • Byte parity on datapaths. • Parity checking where parity preserved, duplication otherwise. • Two-parity bits on registers. • Modified Hamming Code on Main Memory. • Maintenance Channel for observability and controlability.

Maximizing System Availability: Lessons from AT&T

Maximizing System Availability: Lessons from AT&T

Presentation Transcript

SSD2: Introduction to Computer Systems

An introduction to Embedded Systems

Oracle High Availability using Veritas Cluster Server (VCS) Chris Lawson Performance Solutions chris_lawson@yahoo

Chapter 2

health care Systems

COMPSCI 210S1C 2014 Computer Systems 1 Introduction

Introduction to Embedded Systems

Grayslake Central High School

Machine Programming - Introduction CENG331: Introduction to Computer Systems 5 th Lecture

Ignition and Electrical Systems

Introduction

CSE 331: Introduction to Networks and Security

Machine Programming - Introduction CENG331: Introduction to Computer Systems 4 th Lecture

SATCOM Availability Analysis

Introduction to Quality Management Systems for Medical Devices

Embedded Systems an introduction

Fuzzy Systems

Operating Systems

CS 333 Introduction to Operating Systems Class 17 - File Systems

Chapter 3 Memory Systems