Workshop on Parallel and Distributed Real-Time Systems 2005

Challenge Problem SessionDetection and Reaction to Unplanned Operational Events in Large Scale Distributed Real-Time Embedded Systems Workshop on Parallel and Distributed Real-Time Systems 2005 April 4th and 5th, 2005, Denver, Colorado

Challenge Problem Context • More real-time and embedded systems are becoming Quality of Service enabled thus allowing for the management of resources in a more dynamic policy based manner • The mechanisms for defining and operating on this policy are still maturing • These systems are also moving towards more peer-to-peer implementation of resource allocation for managing large-scale distributed networks of mixed hard and soft real-time subsystems • The computing devices, consisting of multiple blade processors, numbering in the hundreds and are connected via combination of LANs, WANs, and wireless communications.

Challenge Problem • One of the challenges in the management of resources (e.g., processors, memory, networks, communications, power) is the detection and reaction to operational events that were unplanned or unanticipated but shouldn’t cause failures (unexpected behavior). • An example of this is receipt of a larger number of requests for service than specified by the requirements or anticipated by the system designers for a capability that if it fails would have a significant impact, e.g., cause the loss of a great deal of money. • What approaches, methods, architectural features, and mechanisms exist, are under development, or are the subject of research to deal with these sorts of situations?

Discussion Points (1 of 3) • In many large-scale real-time systems there are both periodic and aperiodic processes driven by data exchanges (messages) that affect the system performance. • In QoS enabled systems, end-to-end deadlines may be specified for a set of applications that make up an operation’s capability. The policy for responding to certain events may also be specified. • The occurrence of unplanned operational events may or may not cause resource exhaustion. • The detection of and remediation action for unanticipated operational events may be specified by a function that defines a set of thresholds (e.g., upper and lower bounds) and the action(s) to be taken when these thresholds are exceeded.

Discussion Points (2 of 3) • Is it better to have separate detection/reaction models for fault detection and handling and for unplanned operational events. Or does this make for a more complicated solution? • Given the nature of distributed systems, what might be the issues with implementing peer-to-peer mechanisms for event detection and correlation, policy management, and policy enactment? • There are some existing standards (e.g., the IETF SNMP and Distributed Management Task Force (DMTF) Common Information Model (CIM) that have been used by some of the enterprise level system management products (e.g., CA Unicenter, IBM Tivoli) but these don’t really address real-time QoS based resource management. How can these be extended to support the DRE space for this type of problem?

Discussion Points (3 of 3) • What are the issues within both systems and software engineering disciplines to the development of solutions to these challenge problems (e.g., what are some changes in processes and culture within these disciplines that are necessary to support the development of robust solutions that can exceed specified requirements, but don’t “break the budget” during the project development life-cycle)?

Workshop on Parallel and Distributed Real-Time Systems 2005