SmartApps: Application Centric Computing with STAPL

SmartApps: Application Centric Computing with STAPL Lawrence Rauchwerger http://parasol.tamu.edu/~rwerger Parasol Lab, Dept of Computer Science, Texas A&M N. Amato, B. Stroustrup, M. Adams

STAPLApplication DataBase Get Runtime Information (Sample input, system information, etc.) Predictor &Evaluator Execute Application Continuously monitor performance and adaptas necessary Predictor &Optimizer SmartApps Architecture Static STAPL Compiler Predictor &Optimizer Augmented withruntime techniques Compiled code + runtime hooks Smart Application Compute Optimal Application Predictor &Evaluator and RTS + OS Configuration Large adaptation(failure, phase change) advanced stages development stage Toolbox Recompute Application Configurer and/or Reconfigure RTS + OS Adaptive Software Adaptive RTS+ OS Runtime tuning (w/o recompile) Small adaptation (tuning)

Outline • SmartApps Concept • STAPL is vehicle we use to develop the SmartApps Framework • STAPL • Overview • High Level Adaptivity in STAPL • Consistency Models • FAST - Framework for Algorithm Selection & Tuning • RTS • Compiler (Pivot)

STAPL: A library of parallel, generic constructs based on the C++ Standard Template Library (STL). Components for Program Development pAlgorithms, pContainers, Views, pRange Portability and Optimization STAPL RTS and Adaptive Remote Method Invocation (ARMI) Communication Library Framework for Algorithm Selection and Tuning (FAST) STAPL: Standard Template Adaptive Parallel Library

Applications Using STAPL • Particle Transport - TAXI • Bioinformatics - Protein Folding • Geophysics - Seismic Ray Tracing • Aerospace - MHD • Seq. “Ctran” code (7K LOC) • STL (1.2K LOC) • STAPL (1.3K LOC)

Consistency Models Processor Consistency (default and currently supported model) • Accesses from a processor on another’s memory are sequential • Requires in-order processing of RMIs • Limited parallelism Object Consistency (future) • Accesses to different objects can happen out of order • Uncovers fine-grained parallelism • Accesses to different objects are concurrent • Potential gain in scalability • Can be made default for specific computational phases Mixed Consistency (future) • Use Object Consistency on select objects • Selection of objects fit for this model can be: • Elective – the application can specify that an object’s state does not depend on others’ states. • Detected – if it is possible to assert the absence of such dependencies • Use Processor Consistency on the rest

Algorithm Selection Problem Description: Given multiple implementations of anabstract operation with a specified execution environmentandinput data,choose one that maximizesperformance • Our Objective: Create general framework for parallelalgorithm selection • Applicable to any abstract operation: sorting, matrix multiplication, convex hull, … • Flexible specification of execution environment and input data parameters • Environment: #procs, memory interconnection, cache, thread management, OS policies • Input data: data type, layout, size, other properties (e.g., sortedness of input) • Generic modeling interface – interchange different approachesUse generic machine learning approaches • Our Strategy: FAST: Framework for Algorithms Selection & Tuning • Integrated approach within STAPL • Selection is transparent to end user • Library adaptively choose the best algorithm at run-timefrom a library of implementation choices

FAST Architecture

FAST Architecture • Problem Specification • User specifies parameters that may affect performance (e.g., num processors, input size, algorithm specific) and ranges (e.g, input size=100M..200M). • User supplies list of candidate implementations.

FAST Architecture • Metric Collection • User provides method to collect data about an implementation execution (e.g., presortedness measure).

FAST Architecture • Instance Selection • FAST selects instances to execute and use when making decision model.

FAST Architecture • Instance Execution • FAST invokes user interface to execute test instances and collect metrics and timings.

FAST Architecture • Database Insertion • FAST inserts results of training instances into database.

FAST Architecture • Model Generation • FAST uses training instances to generate decision model.

FAST Architecture • Selection Querying • FAST provides interface for user application to query generated model.

Parallel Sorting: Experimental Results • Attributes for Selection Model • Processor Count • Data Type • Input Size • Max Value (impacts radix sort) • Presortedness SGI Altix Selection Model SGI Altix Validation Set (V1) – 100% Accuracy N=120M Adaptive Performance Penalty

Parallel Sorting - Altix Relative Performance (V2) • Model obtains 99.7% of the possible performance. • Next best algorithm (sample) provides only 90.4%.

FAST: Future Work • Improve and refine algorithm selection framework • Refine sampling for training input selection • Algorithm developer specified models • Online feedback/model refinement (incremental learning) • Machine characterization & microbenchmarks • Expand employment in STAPL. • pContainers - locking strategies, data (re)distribution, consistency model • pAlgorithms - more algorithms, parameter tuning • pRange - work granularity and scheduling policies • ARMI Runtime System - dynamic aggregation.

RTS – Current state Comm. Thread RMI Thread Task Thread Smart Application Application Specific Parameters STAPL RTS ARMI Executor Memory Manager Advanced stage ARMI Executor Experimental stage: multithreading Custom scheduling K42 User-Level Dispatcher Kernel scheduling Kernel Scheduler (no custom scheduling, e.g. NPTL) Operating System

ARMI – Current State ARMI: Adaptive Remote Method Invocation • Abstraction of shared-memory and message passingcommunication layer (MPI, pThreads, OpenMP, mixed, Converse). • Programmer expresses fine-grain parallelism that ARMI adaptively coarsens to balance latency versus overhead. • Support for sync, async, point-to-point and group communication. • Automated (de)serialization of C++ classes. ARMI can be as easy/natural as shared memory and as efficient as message passing.

ARMI Communication Primitives Point to Point Communication armi_async - non-blocking: doesn’t wait for request arrival or completion. armi_sync - blocking and non-blocking versions. Collective Operations armi_broadcast, armi_reduce, etc. can adaptively set groups for communication. Synchronization armi_fence, armi_barrier - fence implements distributed termination algorithm to ensure that all requests sent, received, and serviced. armi_wait - blocks until at least at least one request is received and serviced. armi_flush - empties local send buffer, pushing outstanding to remote destinations.

RTS Current Work - Multithreading In ARMI • Specialized communication thread dedicated the emission and reception of messages • Reduces latency, in particular on SYNC requests • Specialized threads for the processing of RMIs • Uncovers additional parallelism (RMIs from different sources can be executed concurrently) • Provides a suitable framework for future work on relaxing the consistency model and on the speculative execution of RMIs In the Executor • Specialized threads for the execution of tasks • Concurrently execute ready tasks from the DDG (when all dependencies are satisfied)

TAXI Experimental Results • uBGL - 1024 node, 2048 processor unclassified Blue Gene cluster at LLNL. • Results shown for two machine configurations. • Virtual Mode - two independent computation processes per node. • Coprocessor - one processor per node for computation and one for communication.

TAXI Results on uBGL

Smart Application Virtualization Layer Scheduler Virtualization Provides a unified view of the RTS services M:N 1:1 Other RTS (E.g., Converse, Marcel …) Best Affinity Unsupported System RTS – The Next Generation Application Specific Parameters STAPL RTS implementation ARMI Communicator Memory Manager Executor Synchronization Server (fence, distributed locks …) Threading Model Smart User-Level Scheduler K42 User-Level Dispatcher Kernel Scheduler (no custom scheduling, e.g. NPTL) Supported System

RTS Virtualization Another level of portability • Allow porting the Smart Application to another runtime system • Achieving the best possible affinity between RTS and OS If another RTS is better for a specific system, use it! • Abstract the complexity of supporting a heterogeneous group of heterogeneous systems Avoid the special-case implementation paradigm • Providing support for multiple architectures is costly • In time and maintenance • In performance (a system supported as a special case will invariably present poorer performance than the principal targets). • Why emulate the wheel ?

RTS Threading Models 1:1 threading model : (1 user-level thread mapped onto 1 kernel thread) • Default kernel scheduling • Heavy kernel threads M:N threading model : (M user-level threads mapped onto N kernel threads) • Customized scheduling • Enables scheduler-based optimizations (e.g. priority scheduling, better support for relaxing the consistency model …) • Light user-level threads • Smaller threading cost • Can match N with the number of available hardware threads : no kernel-thread swapping, no preemption, no kernel over-scheduling … • User-level thread scheduling requires no kernel trap • Perfect and free load balancing within the node • User-level threads are cooperatively scheduled on the available kernel threads (they migrate freely).

RTS Executor Customized task scheduling • Executor maintains a ready queue (all tasks for which dependencies are satisfied in the DDG) • Order tasks from the ready queue based on a scheduling policy (e.g. round robin, static block or interleaved block scheduling, dynamic scheduling …) • The RTS decides the policy, but the user can also specify it himself • Policies can differ for every pRange Customized load balancing • Implement load balancing strategies (e.g. work stealing) • Allow the user to choose the strategy • K42 : generate a customized work migration manager

RTS Synchronization • Efficient implementation of synchronization primitives is crucial • One of the main performance bottlenecks in parallel computing • Common scalability limitation Fence • Efficient implementation using a novel Distributed Termination Detection algorithm Global Distributed Locks • Symmetrical implementation to avoid contention • Support for logically recursive locks (required by the compositional SmartApps framework) Group-based synchronization • Allows efficient usage of ad-hoc computation groups • Semantic equivalent of the global primitives • Scalability requirement for large-scale systems

The Pivot:Static Analysis of C++ Applications (lead: Bjarne Stroustrup) Object code C++ source Tool 1 Compiler Compiler Compiler IDL Tool 2 IPR C++ source Tool 3 Tool 4 Specialized representation (e.g. flow graph) XPR “information”

Context for the Pivot • Semantically Enhanced Library (Language) • Enhanced notation through libraries • Restrict semantics through tools • And take advantage of that semantics Domain Specific Library C++ Semantic Restrictions • Bell Labs Proverbs • Library design is language design • Language design is library design

Context for the Pivot Domain Specific Library C++ • Provide the advantages of specialized languages • Without introducing new “special purpose” languages • Without supporting special-purpose language tool chains • Avoiding the 99.?% language death rate • Provide general support for the SELL idea • Not just a specialized tool per application/library • The Pivot fits here Semantic Restrictions

Current and future work • Complete infrastructure • Complete EDG and GCC interfaces • Represent headers (modularity) directly • Complete type representation in XPR • Initial applications • Style analysis • including type safety and security • Analysis and transformation of STAPL programs • Build alliances • Currently: compiles Firefox • Inserts measurement & control for SmartApps

Conclusion • STAPL is a good platform for developing SmartApps • used on several large-scale, complex applications • We plan to release a version soon to friendly users • Intel TBB – The simple STAPL for multicores University research into commercial product • More info at http://parasol.tamu.edu/

SmartApps: Application Centric Computing with STAPL

SmartApps: Application Centric Computing with STAPL

Presentation Transcript

Net Centric Computing

Information Centric Super Computing

Introduction to Net Centric Computing

Advance Net-Centric Computing Technology

ISRC Future Technology Briefing: Human-Centric Computing

Semantically Rich Application-Centric Security in Android

Lecture 3: Introduction to Net-centric Computing

Application Centric Infrastructure

Web-centric Computing:

User-Centric Computing

NETWORK CENTRIC COMPUTING

Net-Centric Computing Overview

Net-Centric Computing Content Formats

CS 290B Java-centric Network Computing

Candidate Collaborative Projects for Net-Centric Application

Capital Computing Application

Information Centric Super Computing

Net Centric Computing

Web-centric Computing

Reliability Analysis in Net-Centric Computing

End User Computing An User Centric Process

Cloud Computing Application Development