High Performance Programming with C++

High Performance Programming with C++ Hafiza Rabbia Ibrahim July 25, 2011 R.Ibrahim (CE Master)

Outline • Motivation • Return Value Optimization (RVO) • Inlining • Standard Template Library (STL) • Constructor and Destructors • Virtual Functions • Coding Optimization R.Ibrahim (CE Master)

Motivation R.Ibrahim (CE Master)

Return Value Optimization (RVO) R.Ibrahim (CE Master)

Why ? “The optimization often performed by the compilers to speed up your source code by transferring it and eliminating object creation.” R.Ibrahim (CE Master)

For instance, let’s walk through a simple example of complex numbers: Without optimization, the compiler generated code for Complex _ Add() is: R.Ibrahim (CE Master)

The compiler can optimize the Complex _ Add( ) by eliminating the local object retVal and replacing it with __tempResult. This is RVO: R.Ibrahim (CE Master)

Execution time comparison R.Ibrahim (CE Master)

Is it mandatory? • NO! • The application of RVO is up to the discretion of compiler implementation. You need to consult your compiler documentation or experiment to find if and when RVO is applied. R.Ibrahim (CE Master)

INLINING R.Ibrahim (CE Master)

What we are avoiding: Method Invocation Costs R.Ibrahim (CE Master)

Why Inline? • most significant performance enhancement technique available in C++. Program’s Fast Path • the portion a program that supports the normal , error free, common usage cases of he program’s execution. • typically less than 10% of the program’s code lies on this fast path. Inlining and Fast Path “ Inlining allows us to remove calls from the fast path.” R.Ibrahim (CE Master)

Inlining Performance Story Performance of avoiding expensive method invocation Cross Call Optimization Performance R.Ibrahim (CE Master)

Performance gain of Avoiding method invocation • when outlined: 62 seconds execution time • wheninlined: 8 seconds execution time • Inlining provided here, 8x performance gain R.Ibrahim (CE Master)

Performance gain of Cross Call Optimization R.Ibrahim (CE Master)

Performance gain of Cross Call Optimization (cont.) If inlined: simple optimization and calculations If outlined: no one method optimization is possible, intra-method optimization is only possible R.Ibrahim (CE Master)

Why not Inline? If Inlining is that good, why don’t you inline everything? R.Ibrahim (CE Master)

Issues with Inlining • Size of program source code increases • Storage issues • multiple instances -> each has unique address • each has storage in cache -> decrease in cache size • capacity miss rate of cache • Degenerative characteristics • exponential code growth R.Ibrahim (CE Master)

Inlining A,B,C,D will increase the code size by more than 70k bytes i.e.; 37x increase. R.Ibrahim (CE Master)

When you should inline to be optimized? R.Ibrahim (CE Master)

How we are avoiding: Inlining Optimization Tricks When INLINE is not defined , the .h file will not include the inlined methods , but rather these methods will be included in the .c file, and the inline directive will be stripped from the front of each method. R.Ibrahim (CE Master)

R.Ibrahim (CE Master)

concluding words about Inlining • Inlining “might” improve the performance. • Inlining may backfire i.e.; increase the size of the code Be sure about the real cost of calls on your system before using Inlining! R.Ibrahim (CE Master)

Standard Template Library(STL) R.Ibrahim (CE Master)

Questions to be answered • Faced with a given computational task, what containers should I use? Are some better than others for a given scenario? • How good is the performance of the STL? Can I do better by rolling my own home-grown containers and algorithms? R.Ibrahim (CE Master)

Execution time Comparisons INSERTING AT THE FRONT R.Ibrahim (CE Master)

Execution time Comparisons (cont.) DELETING ELEMENTS AT THE FRONT R.Ibrahim (CE Master)

Execution time Comparisons (cont.) Container traversal speed R.Ibrahim (CE Master)

Can I do better? R.Ibrahim (CE Master)

Comparison STL speed to Home-grown code R.Ibrahim (CE Master)

Conclusions about STL performance • Outperforming the STL is possible. • Bend over backwards to concoct scenarios in which a home grownimplementation outperforms the STL. • Outperforming STL ,home grown implementation should havesomethingbetter that STL does NOT have! R.Ibrahim (CE Master)

Constructors and Destructors R.Ibrahim (CE Master)

Why this analysis? • The performance of constructors and destructors is often poor due to the fact that an object's constructor (destructor) may call the constructors (destructors) of member objects and parent objects. • This can result in constructors (destructors) that take a long time to execute, especially with objects in complex hierarchies or objects that contain several member objects. • Hence a Performance Hit! R.Ibrahim (CE Master)

Connection b/w cost of constructor/destructor and Inheritance based design • Encounter: Implementation of thread synchronization constructors • In multithreaded applications ,there should be thread synchronization to restrict concurrent access to shared resources • Thread synchronization constructs can be any of : • Semaphore • Mutex • Critical Section R.Ibrahim (CE Master)

Strategy: • Encapsulate the lock in an object e.g. MutexLock object • Let the constructor obtain the lock • Destructor will release the lock automatically (as it does for regular objects) • Compiler inserts a call to the lock destructor prior to each return statement • And the lock is always released! R.Ibrahim (CE Master)

Performance Comparison constructors destructor behaviour with Mutex in case of • Non-inherited object • inherited object R.Ibrahim (CE Master)

Lock class implementation R.Ibrahim (CE Master)

BaseLock class implementation This class is intended as a root class for the various lock classes that are expected to be derived from it. R.Ibrahim (CE Master)

Subclass of BaseLock: MutexLock class implementation LogSource object is meant to capture filename and source code line where the object was constructed. R.Ibrahim (CE Master)

MutexLock constructor R.Ibrahim (CE Master)

MutexLock Destructor R.Ibrahim (CE Master)

Non-inherited Mutex Object SimpleMutex object from a class containing acquire( ) and release( ) methods R.Ibrahim (CE Master)

Inherited Mutex Object replace SimpleMutex by DerivedMutex ( object of a derived class from BaseMutex) R.Ibrahim (CE Master)

Execution Time comparison R.Ibrahim (CE Master)

Concluding Remarks • Distinguish b/w over all computational cost, required cost, and computational penalty. • Eliminate the one which is not important by some other mechanism • Over all cost increases with the size of derivation tree. R.Ibrahim (CE Master)

VIRTUAL FUNCTIONS R.Ibrahim (CE Master)

Inflict on performance • Class with Virtual function -> virtual function table (vtbl) -> assigns each object a pointer -> vptr. Virtual functions seem to inflict a performance cost in several ways: • The vptr must be initialized in the constructor • VFs are called using pointer indirection, resulting a few extra instructions per method invocation. • Inlining is compile time decision. The compiler cannot inline VFs whose resolution takes place at run time. R.Ibrahim (CE Master)

Performance Comparison for virtual and Non-virtual methods • Creating virtual objects costs more than creating non-virtual objects, because the virtual function table must be initialized. • And it takes slightly longer to call virtual functions, because of the additional level of indirection. R.Ibrahim (CE Master)

Performance Comparison for virtual and Non-virtual methods (cont.) R.Ibrahim (CE Master)

Construction/destruction shows the performance penalty of initializing the virtual function table. • Virtual function invocation is slightly expensive than invoking a function through a function pointer : memory overhead. R.Ibrahim (CE Master)

High Performance Programming with C++

High Performance Programming with C++

Presentation Transcript

Programming for High Performance Computers

Programming with C#

High Performance Parallel Programming

High Performance Parallel Programming

Programming Fundamentals (with C#)

C# High Performance Mobile Apps

Programming with C#

Network Programming with C#

High Performance Programming with C++

Ice Programming with C++

Ice Programming with C++

High Performance Parallel Programming

High Performance Parallel Programming

Programming with C# 3.0

Programming with ANSI C

Programming with C#

High-Productivity Stream Programming for High-Performance Systems

Programming with C++

Programming High Performance Applications using Components

Programming with C#