140 likes | 263 Vues
Given the critical role of fault tolerance (FT) in computing, this discussion explores strategies across hardware, OS, and application levels. We analyze the structure of scalable operating systems, defining local and global OS functions, and their services. Key functionalities, such as process management, file systems, and security, are examined in the context of scalability and performance, particularly on heterogeneous architectures including FPGAs and PIMs. We also address the implications of hardware features on OS design and discuss benchmarks for assessing OS success.
E N D
Fault tolerance • Given that FT is critical, what could/should be done at hw/os/runtime/app level? • ALL
Structure of Scalable OS • What are the entities? • How we define local/global OS functions • What is the functionality of the local OS services? Is none an answer? • What are global functions? • Can we adapt PVM to the app it supports? • Protection boundaries and virtualization with OS • What’s OS/runtime split? • ALL
APIs • Runtime/OS • Application/runtime • Tool interfaces (including debugging) • Interfaces to environment info • 10
Specific functions • Process management 9 • File system 18 • Scheduling 10 • Security 2 • QoS 2 • Debugging – invariants 9
OS scalability • What OS services could/should scale • How do we define scalability? • performance nearly independent of machine size? • reliability nearly independent of machine size • 10
OS for heterogeneous hw • How do we build runtime/OS support for “crazy” architectures? • FPGAs, PIMs,… • Do we adapt one parallel OS to very different hw architectures? Do we need different OS/runtime solutions? • What is the spectrum of hw architectures that we can support with one common OS/runtime design? • 15
Interactive systems • How do we move HEC into interactive environments? • What are interactive HEC apps? • How do we do interactive debugging? Interactive tools? Interactive computational steering? Short shell commands? WS acceleration model? Visualization? • 12
Hw support for OS • Study which hw features are important to future scalable OS/runtime – so as to influence hw design; E.g. • Protection • Reliable networks • Collective ops • Atomic memory ops • Transactional memory • 16
Application requirements • What OS calls are now used by High Perf Apps? • What requirements can we derive for OS/runtime in future systems from apps? • Identify critical apps we care about • 14
OS metrics • What benchmarks and metrics we use to measure success? • 8
Programmatic • How we get organized to do research in scalable OS? • Multiple approaches • Extreme alternatives • Vendor involvement 12
Vendors • How can we use existing OS sw • Proprietary and/or open source • 8
Testbeds • How do we establish testbeds to support scalable OS/runtime research • Who funds them • What is a testbed? Architecture specific? Simulator? • 15