1 / 42

developing high performance hp-ux applications on the Intel® Itanium® processor

developing high performance hp-ux applications on the Intel® Itanium® processor. Hewlett-Packard June 2003. Key Intel ® Itanium® Processor Family Features. predication speculation support for modulo scheduling rotating registers. predication.

jun
Télécharger la présentation

developing high performance hp-ux applications on the Intel® Itanium® processor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. developing high performance hp-ux applications on the Intel® Itanium® processor Hewlett-Packard June 2003

  2. Key Intel ® Itanium® Processor Family Features • predication • speculation • support for modulo scheduling • rotating registers

  3. predication • allows instructions to be dynamically turned on or off using a predicate register value: • example: cmp.eq p1, p2 = r1, r2 ;; (p1) add r1 = r2, r4 (p2) ld8.sa r7 = [ r8 ],8 • if p1 is true, the add is performed, else it acts as a nop • if p2 is true, the ld8 is performed, else it acts as a nop

  4. control speculation original: (p1) br.cond ld8 r1 = [ r2 ] transformed: ld8.s r1 = [ r2 ] . . . (p1) br.cond . . . chk.s r1, recovery data speculation original: st4 [ r3 ] = r7 ] ld8 r1 = [ r2 ] transformed: ld8.a r1 = [ r2 ] . . . st4 [ r3 ] = r7 . . . chk.a r1, recovery speculation

  5. modulo scheduling • overlapping execution of different loop iterations Itanium-based modulo scheduling (through register rotation and predication) traditional modulo scheduling (through unrolling) no software pipelining

  6. hp-ux Itanium-based C++ development tools

  7. Overview of C++ Tools for HP-UX Open Source Third Party HP TogetherSoftControlCenter Magic Draw Design Rational Rose Object Domain Eclipse VIM Visual SlickEdit CodeForge Edit Firebolt Softbench NetBeans XEMACS Bristol Tributary WindRiver SNiFF+ Compile C/ANSI C aC++ GCC G++ Build Make Make Clearmake Debug TotalView HP WDB GDB DDD Manage CVS RCS Clearcase ParasoftCodeWizard Parasoft Insure++ Analyze HP WDB Rational Purify Rational PureCoverage Optimize HP Prospect Rational Quantify HP Caliper Support DSPP

  8. performance with reliability and ease-of-use testing strategies for bulletproof optimization large apps white box tests random tests hot code: aggressively optimized debugger support for debug of optimized code compiler quality designed in full debug support cold code: lightly optimized caliper transparent profiling profile database

  9. hp Caliper 1.0 hp Caliper is a suite of program analysis tools • release 1.0 contains three tools: • caliper/PMU: measure performance using Itanium-based PMU • caliper/PBO: generate feedback file for compiler PBO • caliper/gprof: get gprof-style information using PMU • developers can generate faster code with caliper/PBO and HP Itanium compilers • developers can measure performance on Itanium-based platforms with caliper/PMU and caliper/gprof

  10. hp Caliper 2.1 (for HP-UX 11i 1.6 and later) • Full support for measuring multi-process applications with output files for each process. • Identifying and selecting one, some or all processes for measurement. • Saving performance results to data files for report generation. • Support for attaching and detaching processes. • Limiting PMU measurements to specific code regions. • Improved cgprof accuracy. • Various reporting changes and improvements, including cumulative percentages in text reports and changes to report link-time addresses. • Improved shared memory handling, memory and performance savings.

  11. profile based optimization • a critical performance tool • application branching behavior is measured • this information is fed into the compiler to guide optimization • predication, speculation, code layout, code generation for switch statements, etc. • studies show that profiling nearly always pays off, even under slightly different workloads • new features for Itanium-based PBO • post-link & dynamic instrumentation • options and pragmas

  12. instrumented application compiler +I PA-RISC or Itanium-based optimized application profile database compiler training data sets -O +P Caliper application Itanium-based collecting profile information

  13. impact of profiling & optimization levels

  14. developer-specified profile information • #pragma estimated_frequency f Example: foo() { if( cond ) { #pragma estimated_frequency 0.8 … for( …) { #pragma estimated_frequency 4.0 … } } } • #pragma frequently_called symbol[,symbol]* • #pragma rarely_called symbol[,symbol]*

  15. Compiler Optimization

  16. levels of optimization • +O1 (default) • low-cost optimizations • instruction reordering • efficient instruction packing • no reordering of user-visible state updates • supports full debugging • performed under -g with no optimization explicitly specified • some limitations on where local variables may be modified from within the debugger

  17. levels of optimization • -O (+O2) performs intraprocedural optimization, plus user-directed inlining (C++ only) • +O3 performs interprocedural optimization within a source file • +O4 performs cross-module optimization within a load module • interprocedural optimization includes inlining and cross-module analysis • +Ofast provides a combination of options which are valid for most applications: -O +Olibcalls +Onolimit +Ofltacc=relaxed +FPD +DSnative +Oshortdata

  18. -O, +O1, +O2: optimizefunctions, bind withincompilation units +O3: optimize & bind withincompilation units +O4: optimize & bind withinload modules compiler has no visibility across load modules improve performance by restructuring ensuring that frequent call paths are within the same load module, and ideally within the same source file application scope of compilation & optimization load modules compilation units functions

  19. developer-guided optimization:inline assembly • semaphore operations _Asm_cmpxchg, _Asm_xchg, _Asm_fetchadd • memory management _Asm_lfetch, _Asm_fc • miscellaneous _Asm_popcnt, _Asm_mux1, _Asm_mux2 • plus many more • fully integrated at the source level • operand expressions and target lvalues • fully integrated into optimization phase

  20. developer-guided optimization:inline assembly • fences allow developer to constrain code motion • upward or downward • for specific instruction types (can specify more than one): • externally visible memory accesses • floating point operations • alu operations • system operations • call instructions • branch instructions • fences may be • specified as a standalone pseudo-assembly instruction • associated with an inline assembly instruction

  21. developer-guided optimization:if-conversion & loop unrolling • #pragma if_convert Example: foo() { for( ... ) { #pragma if_convert if( ... ) { ... } else if( ... ) { ... } else { ... } } } • #pragma unroll_factor

  22. developer-guided optimization • +O[no]store_ordering • preserves program order for stores to memory locations that are possibly visible to another thread • note that this does not imply strong ordering • appropriate when volatile semantics are not required • ensures that state is consistent on signals, context switch, etc

  23. overcoming performance limiters • the scope visible to the compiler determines the limits of optimization • the compiler must generally make conservative assumptions about • aliasing • which pointers may point to the same data • binding • in which load module a data or code reference will be resolved • exception behavior • floating point accuracy and precision requirements

  24. aliasing • make local copies while ( ... ) p->foo += ... • use high levels of optimization to increase compiler visibility

  25. aliasing • +Otype_safety=[off|limited|ansi| strong] • asserts type safety within a compilation unit • pointers reference only their declared type except: • char * may point to anything (limited, ansi) • int fields of structs & unions may be referenced by an int * (limited, ansi) • unnamed objects are assumed to have unknown type (limited) • #pragma no_side_effects

  26. aliasing (cont) • +Onoparmsoverlap (Fortran-like semantics) copy( char *s1, char *s2 ) { while ( *s2 != 0 ) *s1++ = *s2++; }

  27. executables, shared libraries and symbol binding • shared libraries are fundamental to application architecture • but binding across shared libraries incurs additional cost • for any symbols not defined in the current compilation unit, the compiler must assume that they might be defined in a separate load module • data must be referenced through linkage table • function calls are indirect through the linkage table • and data pointer (gp) must be saved and restored around the call

  28. compiler bindingclasses • default • if defined in the same compilation unit • bind directly • otherwise • global data items indirect through the linkage table • gp saved around calls • direct call assumed; linker inserts stub if needed • extern (-Bextern, +Oextern) • always go through linkage table, even if defined locally (expected to be preempted) • import stub emitted inline • gp saved around calls

  29. compiler binding classes • protected (-Bprotected) • must be defined in the same load module (linker error if not) • global data items referenced directly • gp not saved around calls • direct call assumed • hidden (-Bhidden) • like protected, but not visible to other load modules

  30. compiler binding classes • specifying the target load module • -exec when building an executable (a.out) • specifying binding types • -B[no]extern, -Bprotected, -Bhidden • +dumpextern filename • example • % cc -Wl,+dumpextern extFile *.o • % cc -exec -Bnoextern -Bextern:extFile *.o

  31. floating point accuracy: +Ofltacc • specifies the level of floating point accuracy required: • +Ofltacc=strict (also +Ofltacc): disallows any optimizations that may change result values • +Ofltacc=default: allows contractions • e.g. fused multiply-add ( a = b * c + d  a = fma(b,c,d) ) • +Ofltacc=limited (Itanium-based) allows optimizations which may affect the generation and propagation of NaNs and the sign of zero • e.g. x*0.0  0.0 • +Ofltacc=relaxed (also +Onofltacc): also allows optimizations (such as reordering of expressions) that may change rounding error • e.g. a = b * c * d * e  a = (b * c) * (d * e) • for C and C++, this option must be given to enable the sum reduction optimization

  32. floating point exceptions & flags • by default, the compiler assumes that applications • do not rely on precise floating point exceptions • do not query the value of floating point flags • conservative behavior can be requested with +Ofenvaccess • #pragma FLOAT_TRAPS_ON is equivalent to +Ofenvaccess (the PA definition is a bit more conservative)

  33. other floating point options • +O[no]cxlimitedrange • equivalent to STDC CX_LIMITED_RANGE pragma • default is +Onocxlimitedrange • -fpeval=[float|double|extended] • -fpwidetypes • +O[no]libmerrno

  34. data and pointer sizes • hp-ux supports both 32-bit and 64-bit data models • on both PA-RISC and Itanium-based • +DD32, +DD64 • in general, the 32-bit data model is more efficient • see “64-Bit Application Development for PA-RISC & Itanium” under “64-Bit Computing” on http://devresource.hp.com/devresource/Topics/Porting/Port.html • smaller data structures are generally more efficient • reduce size of Booleans and enums within objects • on Itanium-based solutions, for applications with less than 4Mb of global data, the +Oshortdata option improves performance • default with +Ofast on Itanium-based

  35. target-specific optimization • by default, the compiler will generate code which will run well on all current platforms • the +DS option specifies a target implementation • Itanium-based: +DSblended (default), +DSitanium, … • +DS Options do not affect compatibility with older systems

  36. constants • in earlier PA-RISC compiler versions, constants and string literals were generally placed in process-private data • this is less efficient than placing them in read-only data, and prevents sharing • accommodates developers who modify string literals • in the latest PA-RISC C and C++ compiler releases, and on Itanium • +Olit=const is default for C • +Olit=all is default for C++ • the difference is limited to the treatment of string literals • for +Olit=const they are only placed in read-only data if they are in a context where const char * would be legal

  37. volatile • the C compiler supports four new type qualifiers to modify the volatile qualifier • __unordered • strong ordering is not required • __side_effect_free • (e.g. not to I/O space): the compiler is free to remove redundant references and/or to speculate loads • __synchronous • the value is not updated by another thread • __non_sequential • sequentiality need not be maintained relative to other memory references • the most important is __unordered

  38. Compatibility

  39. Statement of Source Compatibility Between PA-RISC and Itanium-based • As of HP-UX Release 11i v1.6 (Itanium-based), ISV and customer applications* that are supported on HP-UX 11i v 1 on PA-RISC will compile and execute correctly on Itanium with no changes to the source code. * Applications must be well behaved and free of explicit dependencies on the PA-RISC architecture. See the following URL for a definition of a well-behaved application:http://devresource.hp.com/STK/hpux11i/exceptions.html

  40. Ensuring a Smooth Migration to Itanium • The C++ compilers for PA-RISC and Itanium-based platforms share common front-end source code. • HP-UX header files and system APIs are based on shared source code for PA-RISC and Itanium. • HP-UX supports 32-bit applications on Itanium, so applications do not need to port to 64-bit. • All HP-UX compilers and libraries must undergo compatibility testing. • Incompatibilities are subject to a thorough review process, and are allowed only when necessary. Once allowed, they are documented as compatibility exceptions. • The Software Transition Kit (STK) can be used to scan an application's source code to look for portability problems. • The HP-UX operating system, its commands and libraries, and dozens of ISV applications, comprising 10s of millions of lines of source code, have been compiled with the new Itanium-based compilers. Incompatibilities not already identified as exceptions have been treated as defects and fixed.

  41. Compatibility Exceptions • K&R C is no longer supported. (A legacy C compiler is provided to minimize the impact of this change.) • Convex parallelization pragmas and library functions are no longer supported. (OpenMP is supported on both platforms as a replacement.) • Architecture-specific code, options, and pragmas must be modified for Itanium. These include PA-RISC assembly code, inline assembly operations, options and pragmas used for tuning code for the PA-RISC architecture and runtime, and calls to any system APIs that are supported only on PA-RISC. • The #pragma HP_ALIGN is no longer supported. (#pragma pack, which is common to Gnu and Sun compilers, should be used instead.) • Floating-point operations may result in slightly different results (Itanium-based will usually give greater accuracy), and applications may observe differences in the treatment of NaNs, denorms, infinities, signed zeroes, exceptions, and flush-to-zero. • The use of any third-party library is subject to the availability and support of that library on Itanium and HP-UX 11i v2. Native Itanium-based code and compatibility-mode PA-RISC code cannot be mixed within a single program.

  42. summary hp-ux Itanium-based compilers provide • performance • reliability • usability

More Related