Dr. Frederica Darema CISE/NSF

Performance Engineering Large Scale Computing Systems SC07-APART Workhop on: Performance Analysis and Optimization of High-End Computing Systems Dr. Frederica Darema CISE/NSF

Outline • The BIG PICTURE • Applications Directions • Computing Platforms Directions • Research and Technology Directions • Examples of some advances • Future Challenges and Opportunities

Science, Engineering, and “Commercial” Applications Environments: how are they shaping in the future What does it entail for:Large-Scale Computing and.. for Large-Scale High-End Computing

Small-Scale and Large-Scale Systems – Increasing complexity of systems and applications … • Processing at multiple levels • Computation and data processing, both at the application and the instruments/sensors side • New Computational Units • Beyond commodity microprocessors /superscalar / (D)MT GPU/(GP)2Us (MC-P), MT, FPGAs, GPUs, … • Populating: high-end platforms, workstations, visualization servers, data servers, etc, … • Potentially: • MC-Ps, FPGAs, GPUs at application side • MC-Ps, FPGAs, GPUs at the data acquisition side • One kind of processor EVERYWHERE??? • Or Mix of MC-Ps, FPGAs, GPUs??? • Pros & deficiencies in each - advances close gaps • Complexity persists and increases

Platforms Directions tac-com fire cntl alg accelerator data base data base fire cntl SAR Past • Vector Processors • SIMD MPPs • Distributed Memory MPs • Shared Memory MPs Present • Distributed Platforms, Heterogeneous Computers and Networks • Heterogeneity • architecture • (computer &network) • node power (supernodes, MCP) Future • Latencies • variable (internode, intranode) • Bandwidths • different for different links • different based on traffic GiBs Grids Petaflops Platform (Grid-in-a-Box) Distributed Platform …. MPP NOW SP

Applications Directions Past • Computation Intensive • Batch • Hours/days • Mostly monolithic • Mostly one programming language • Computation Intensive • Data Intensive • Real Time • Few Minutes/hours • Visualization • Interactive Steering • Integrated Simulations&Experiments Dynamic Data Driven Applications Systems Present / Future • Multi-Modular • Multi-Language • Multi-Developers • Multi-Source Data

Example of new applications and systems directions Dynamic Data Driven Application Systems (DDDAS) (www.cise.nsf.gov/dddas & www.dddas.org) Dynamic Integration of Computation & Measurements/Data (from the Real-Time to the High-End) Unification of Computing Platforms & Sensors/Instruments DDDAS guides sensor systems architectures Simulations (Math.Modeling Phenomenology Observ’n Modeling Design) Theory (First Principles) Dynamic Feedback & Control Loop Experiment Measurements Field-Data (on-line/archival) User DDDAS:ability to dynamically incorporate additional data into an executing application, and in reverse, ability of an application to dynamically steer the measurement process Challenges: Application Simulations Methods Algorithmic Stability Measurement/Instrumentation Methods Computing Systems Software Support Software Architecture Frameworks Synergistic, Multidisciplinary Research

TeraGrid • A distributed system of unprecedented scale • 30+ TF, 1+ PB, 40 Gb/s net • Unified user environment across resources • User software environment User support resources • Integrated new partners to introduce new capabilities • Additional computing, visualization capabilities • New types of resources: data collections, instruments • Created an initial community of over 500 users, 80 PIs • Created User Portal in collaboration with NMI courtesy Charlie Catlett

DDDAS: Beyond Grid Computing “Extended Grid” – “SuperGRID”: the Application Platform is the computational&measurement system Applications Archival/ Stored Data Computational Platforms Instruments Sensors Measurement Grids Computational Grids SuperGrids: Dynamically Coupled Networks of Data and Computations

Examples of TeraGrid Applications Aquaporin Mechanism Animation pointed to by 2003 Nobel chemistry prize announcement. Schulten, UIUC Atmospheric Modeling Droegemeier, OU Reservoir Modeling Wheeler/UTAustin, Saltz/OSU,Parashar/Rutgers Advanced Support for TeraGrid Applications: • TeraGrid staff are “embedded” with applications to create • Functionally distributed workflows • Remote data access, storage and visualization • Distributed data mining • Ensemble and parameter sweeprun and data management Lattice-Boltzman Simulations Groundwater/Flood Modeling Maidment, Wells, UT Coveney, UCLBruce Boghosian, Tufts courtesy Charlie Catlett

To address the complexity of today’s and future systems, applications and their environments We need systematic modeling and analysis approaches for designing, supporting the runtime, and management of such systems Systems Performance Engineering

Background • Systems Modeling and Analysis increasingly important: • systems design cycle and runtime • measurements (static and runtime) • functional correctness of hw, hw and sw performance, dependability, reliability, power management, security, debugging, … • Traditionally/in the past (for example): • modeling specific aspects components, rather than full system • architectural simulators trade speed for accuracy – full-system simulators trade accuracy for speed • Want modeling/simulation capabilities that allow • accurate – cycle level resolution • complete modeling of the entire system • simulate execution of real workloads (full applications or realistic benchmarks) on top of real OS systems • allow users to probe features in the systems (hardware, systems software, application) • A number of research efforts are addressing such challenges, and more…

System Modeling and Analysis develop methods and tools for modeling, measuring, analyzing, evaluating, and predicting the performance, dependability, reliability, runtime management, debugging, security, etc.. for design & runtime support of complex computing and communications systems • Hardware and Software modeling • methods tools and measurements, providing multimodal, hierarchical or multilevel modeling and analysis capabilities of such systems; • methods that describe components of the system, but also the system as a total, and enable assessment of the effects of individual hardware and software layers and components of these systems; • ability to describe the system in multiple levels of detail (characteristics and time-scales); • combine different (hybrid) methods of describing components and layers, from analytical, statistical, to simulation, emulation, etc…. • performance specification languages and compilers • testing & validation of developed methods and tools

System Modeling and Analysis • Modeling and measurement approaches • capabilities to describe, analyze and predict the behavior of the components as well as the systems; • analysis and prediction due to characteristics or changes in the application, system software, hardware; • multilevel approaches and multi-modal approaches • Performance Frameworks • combine tools in “plug-and-play” fashion • multiple views of the system • Use of systems modeling and analysis methods and tools beyond the design cycle.. … that is: to support optimized application composition, mapping, runtime with performance, dependability, fault-tolerance

Application Models Prog.Models Compilers Libraries Tools Systems Modeling and Analysis Distributed Applications Collaboration . . . Visualization Environments / Authenication File/IO Models Scalable I/O Authorization Performance Frameworks Data Management Fault Recovery Archiving/Retrieval Services OS Scheduler Models Services Distributed Systems Management Architecture Network Models Distributed, Heterogeneous, Dynamic, Adaptive Computing Platforms and Networks CPU Device Memory Memory Models . . . Technology Technology Technology

Multiple views of the system The Operating Systems’ view Application Models IO / File Models Languages OS Scheduler Compilers Models Architecture / Libraries Distributed, Heterogeneous, Dynamic, Adaptive Network Computing Platforms and Networks Models Tools Memory CPU Device Memory . . . Models Technology Technology Technology Distributed Applications . . . Collaboration Visualization Environments Scalable I/O Authenication / Data Management Authorization Archiving/Retrieval Dependability Services Services . . . Other Services Distributed Systems Management

Technology for integrated feedback & control Runtime Compiling System (RCS) and Dynamic Application Composition tac-com fire cntl alg accelerator …. data base data base fire cntl SAR MPP NOW SP Application Model Dynamic Analysis Situation Distributed Programming Model Application Program Compiler Front-End Application Intermediate Representation Compiler Back-End Launch Application (s) Performance Measuremetns & Models Dynamically Link & Execute Application Components & Frameworks Distributed Computing Resources Distributed Platform Adaptable computing Systems Infrastructure

Great set of efforts that are developing systems modeling methodsalong these directionsand leading to performance frameworksEmphasis on Multidisciplinary Research(across sub-areas of CS)Application driven validation of research and technology advances Collaborations with industry are fruitfulProjects can be found in the proceedings of the Next Generation Software Workshop Seriesorganized every year in conjunction with IPDPS

GRADS Project & VGRADS PI: Ken Kennedy, (& Dan Reed, Andrew Chien, Fran Berman, Dennis Gannon, Ian Foster, Jack Dongarra, et.al) Performance Real-time Performance Feedback Performance Problem Monitor Software Components Service Config- Whole- Source Grid Negotiator urable Negotiation Appli- Program Runtime Object Compiler cation System Scheduler Program Dynamic Optimizer Libraries Project Goals: To develop program preparation system support for computational Grid applications and technologies to support efficient run-time management of computational Grid resources, and achieve reliable performance under varying load. GrADSoft Architecture Program Preparation System Program Execution System • Performance Contracts - At the Heart of the GrADS Model: • Fundamental mechanism for managing mapping and execution • What are they? • Mappings from resources to performance • Mechanisms for determining when to interrupt and reschedule • Abstract Definition • Random Variable: r(A,I,C,t0) with a probability distribution • A = app, I = input, C = configuration, t0 = time of initiation • Important statistics: lower and upper bounds (95% confidence) • Challenge • When should a contract be violated? • Strict adherence balanced against cost of reconfiguration

Dynamic Adaptive Systems Software for Robust and Dependable Large-Scale Systems {Adve & Sanders}

Montage - An Integrated End-to-End Design and Development Framework for Wireless Networks PI: Rappaport (& Browne, Shakkottai, Ramakrishnan, Varadarajan) {UTAustin, VTech} • Project advanced the state-of-the art in fast and efficient methods for simulating large-scale networks • Deliverables: • generated a wide range of analytical and simulation-based modeling methods • Developed a wireless channel simulator (the Site Specific Software Simulator for Wireless - S^4W) • S^4W was used by the PIs to develop more powerful and efficient techniques for end-to-end improved network performance for users of both wired and wireless networksS^4W has been used by several universities (in US and Canada), industry (Boeing) and NASA, and commercial business (Schlotzky’s deli) • Developed fast simulation capabilities of networks • Fast hybrid network simulation using spatiotemporal dilations FluNet: hybrid simulation-emulation environment, based on combined fluid models • Developed scalable parallel discrete event simulator (Shakkottai, Ramakrishnan) • Open Network Emulator • Highly scalable distributed direct code execution environment; supports both simulation and emulation in a single tool; novel method, using the notion of Relativistic Time, so that the global virtual time is derived by dilating the real (wall-clock) time • Productivity with Performance through Components&Composition (Browne) • P-COM^2environement: automated compile-time/runtime-composition of a parallel programs - applied here to performance modeling

A Fast, Cycle-Accurate Computer System Technology

Fast and Accurate Simulation of Scalable Computer Systems {Falsafi & Hoe} ProtoFlex addresses full-system and scaling complexity for FPGA-based simulation in two ways. Hybrid emulation (a) avoids reconstruction of the entire system on FPGAs. Interleaved emulation (b) lets us decouple the size and complexity of the simulated system from that of the underlying FPGA host. (a) Hybrid Emulation (b) Multiple-context Interleaved Emulation

Examples of Modeling & Analysis Efforts (Performance Modeling Frameworks) • FPGA Accelerated Simulation Technologies – functional simulator + timing model (implemented in FPGAs) for fastest cycle-accurate, full system simulator (within 1-3 orders of real hw) • Fast and accurate simulator through sampling, checkpointing to capture the microarchitectural state, and performing cycle-accurate simulation in the selected sampled regions, to simulate full (unmodified) applications • Structural and composable performance simulation of complex systems effort constructs simulators from system descriptions and component libraries (e.g. produced in 11 wks Itanium2 simulator accurate to 3% of actual hardware) • Real-time large-scale network simulation environment, through a hybrid of continuous and event-driven simulation paradigms, of a fluid-model representation the mean traffic and a packet-oriented simulation. The hybrid testbed will combine advantages of analytical models, simulation and emulation, and physical network testbeds. • Component based software environment for simulation, emulation and synthesis of network protocols, integrating model-checking with event-driven simulations to allow performance evaluation and protocol validation in a unified way • End-to-end design and development framework for large-scale wireless networks - composed through capabilities developed under problem solving environments application compile-time and runtime composition methods to compose the simulation and emulation systems for setting-up experimental testbeds, performance engineering methods (of the POEMS project), the Weaves runtime and the P-COM for parallel/distributed execution of discrete event simulations, and integrate low level channel models to higher level protocol layers and the relativistic time temporal model developed under the collabort’n.

Examples of Modeling & Analysis Efforts(Application modeling, resource management, …) • Modeling system for enabling algorithm designers and programmers to develop, evaluate and compare application algorithms for CMP/CMT systems • Software tools to enable access to coordinated information collected through hardware-based profiling of local and remote memory access of application computation and communication patterns • Dynamic profiling of application phases for optimizing power consumption under set performance constraints for reconfigurable multi-core environments and data servers • Cross platform performance estimation by partial execution of applications, capturing computation and communication parameters, and generalizing prediction to problem-scaling scenarios, in parallel and distributed platforms • Language support continuous monitoring of distributed systems, grids and other data-centric and network systems • Adaptive resource sharing mechanisms autonomically matching resources to dynamically changing needs via statistical and stochastic approaches • Data driven resource allocation in complex systems, through workload characterization, analytical models and policy development • Compiler enabled model- and measurement-driven adaptation environment for dependability and performance (performability) • Engineering reliability at software design time by coupling software component architectural models with statistical methods to address uncertainties in design stage • Tools for pro-active runtime system health monitoring and enhancement for large-scale parallel systems, by collecting and analyzing through on-line models data collected over extended periods of time, and in real-time, filtering and correlating evolving failure data with respect to factors such as workload and operating temperature, and use this information to schedule or checkpoint jobs

Summary Thoughts • Large scale high-End systems cannot be treated as isolated platforms • Such systems demand: enhanced and optimized computation, communication and data management capabilities, in the presence of resource heterogeneity, dynamicity, adaptivity • Need to advance the technologies that will automate the mapping of complex and dynamic applications on complex platforms with multiple and heterogeneous levels of processors, memory, and networks • Modeling and Analysis Methods – Performance Engineering of systems are crucial in enabling optimized design, runtime, and management of such systems

Award 0406351: A Compiler-Enabled Model- and Measurement-Driven Adaptation Environment for Dependability and Performance William Sanders and Vikram Adve Develops compiler controlled performance data monitoring together with performance models for adaptive and optimized runtime support, in environments with underlying computational, communication, and storage resources maybe changing, as well as environments where also the application requirements may be changing Combines and advances in novel directions work on dynamic runtime compilation methods (LLVM) developed by Adve in 0093426(CAREER) - NGS: Techniques and Applications of Dynamic Compilation; and system level integrated performance methods developed by Sanders in 0228762 - Next Generation Software: An Integrated Framework for Performance Engineering and Resource-Aware Compilation In addition to the multidisciplinary work from two sub-areas of computer sciences: compilers and performance modeling and analysis the project includes collaboration with industry, and specifically with two senior researchers from ATT Labs-Research, which provides resources such as production-level software, to drive and validate the research methods, and also provides opportunities for student internships at the ATT Research Lab. Dynamic Adaptive Systems Software for Robust and Dependable Large-Scale Systems Other Technical impacts of the individual projects: Möbius is a performance engineering framework and tool for the evaluation of distributed and parallel computing systems, accounting for system components including the application software itself, the operating system, and the underlying computing and communication hardware. The framework provides a means by which multiple, heterogeneous models can be composed together, each representing a different module (software or hardware), component, or view of the system. Möbius has made a significant worldwide impact in the research area of stochastic model analysis. The impact spans both academic and commercial domains. In addition to being the principal tool used in the graduate-level system reliability courses at the University of Illinois, USA and the Univ. of Florence, Italy, Möbius has been licensed to over 150 university sites throughout the world for teaching and research purposes. International Partnerships with tesearch groups from the Univ. of Twente, Dörtmund University, University of the Federal Armed Forces München, and Saarland University are partnering with the Möbius team to developing plug-in modules for the Möbius framework. The first International Möbius Developer’s Working group meeting was held in Sept. 2004, further increasing the number of groups that use Möbius in their research. Möbius has also been licensed for commercial use to many companies, including: Motorola, Iridium, Pioneer Hybrids, Windber Research Institute, General Dynamics and Boeing. For example, Möbius have been used for numerous telecommunications and computer system applications at Motorola and was designated one of three company wide system availability modeling packages. Recently, researchers have begun to use Möbius for biological applications; over 25 universities and Pioneer Hybrid (the world's largest seed producer) and Windber Research Incorporated (non-profit research organization with projects studying the disease progression of breast cancer) have licensed it for use with biological systems. • Other Technical impacts of the individual projects: • The LLVM compiler infrastructure has been publicly distributed since October 2003 and downloaded well over 2000 times since. • It has attracted at least 40 serious users in academia (instructors and researchers) and industry (startups and established companies). • Apple Computer has not only adopted LLVM and has set up an active group of developers working on incorporating LLVM in Apple’s products such as the next release of MacOS due in Spring 2007 • A paper: Automatic Pool Allocation, on novel methods developed under the project and incorporated in LLVM, won a Best Paper award at the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), the premier conference in the area of compilers.

Dr. Frederica Darema CISE/NSF