Topics

The challenge of migration : desktop to handheldPhil AtkinProduct Manager 3D Graphics September 2004

Topics Overview • Definitions • What does ‘desktop’ mean? • What does ‘handheld’ mean? • Challenges • Management of 3D resources • Management of CPU resources • Case study • Realities of porting a desktop 3D framework to handheld • Demonstrations (Intel / Intrinsyc Carbonado) • Performance (PowerBook vs. Carbonado) • Conclusions

Desktop vs. handheld systems • Desktop system • CPU + GPU + 3D API • Powerful - 1GHz up to >3GHz CPU with SIMD floating-point • Big caches • Minimum ‘Free3D’ chipset • Maximum GeForce 6800 / Radeon X800 • OpenGL 1.5 transitioning to OpenGL 2.0 • Handheld system (PowerVR 3D) • CPU + GPU + 3D API • CPU ranges from 100MHz to 500+MHz • Small caches • CPU may or may not have FP capability • Minimum MBX Lite no VGP - 1M tris, 100M pixels • Maximum MBX VGP - 4M tris, 350M pixels, free AA • OpenGL ES 1.0 transitioning to OpenGL ES 1.1

Handheld 3D • Delivering accelerated handheld 3D is all about power management • All chip vendors have access to similar process technologies • Leads to similar power / MHz • Leads to similar performance / mW • All system vendors have access to the similar battery technologies • Leads to similar ‘talk time / game-time’ per recharge • Some architectures have clear power/performance advantages • Tile-based rendering, on-die framebuffers - minimize data passing between chips • These factors lead to a relatively narrow spectrum of capabilities • Low-end and high-end systems only differ by 3-4x • Admittedly PowerVR sets a high baseline, but the generalization holds

Observations • Even low-end handheld 3D accelerators will offer excellent performance • On par with 2nd / 3rd generation desktop accelerators • Efficient API is in place and standardized • Hence the path from the driver to the hardware is sorted - but … • What about the path from the application to the driver? • How to structure application code to keep hardware busy? • Despite relatively narrow spectrum of 3D capabilities • Potential for extremely large disparity between systems • Floating point-less CPU, rasterizer-only 3D • Very high performance CPU / FPU, vertex-programmable 3D • How to develop or port with such a spread of computational capabilities?

The challenge • Management of 3D capabilities is not the challenge • The usual techniques learned in the desktop space can be used • Resolution / triangle count / texture filtering / AA quality • Management of CPU resources is the challenge • Lowering vertex counts to GPU will inherently lower CPU load • But the problem is far bigger in scope than just this • The data type float is essentially unavailable at the low end • Platform CPUs have such diverse capabilities - either • Stratify in software, code explicitly to each market stratum • Or code in a floating-point agnostic manner • The latter is achievable and allows a single code base across platforms

Why bother porting to an FPU-less platform? • Consider the following 3 likely classes of handheld device • Class A • High-performance CPU, FPU, GPU with vertex processing • Class B • High-performance CPU, GPU with vertex processing • Class C • CPU, rasterizer • Classes B and C will likely be smaller die, lower cost • Will likely ship in higher volumes • If so - • will offer more revenue opportunities for software vendors • yet platforms do not have floating-point capability • But a Class A device may win out • Software vendors must cover all the bases to guarantee success

Why not just make everything fixed point? • Because your desktop platform • Will be faster in floating-point • Does not have fixed-point OpenGL ES entrypoints! • If you really need • The same code base to run on desktop and handheld • High performance on all classes of handheld systems • You need to abstract out your numeric format • C++ class, build-time switchable from 16.16 to float

Porting desktop software - 4 step program • Observations • Debugging on a handheld is no fun • The porting process needs to be derisked as much as possible • Strive to get as close as possible to the handheld codebase without leaving the desktop • Code extremely defensively - make no assumptions regarding performance • ‘Portification’ • Yes, I know it’s not a real word… • The process of preparing for the port without actually executing on it • Step 1 - implement the abstracted real number class • Step 2 - portify 3D code • Step 3 - portify application code • Step 4 - do the port

Step 1 - implement real number class • C++ operators for +-*/ and type conversion • Note ARM does not have a divide instruction • Recommendation - normalize / reciprocate / multiply / denormalize • ARM does have a normalize instruction - CLZ • Functions for common but expensive operations • E.g. implement your own sqrt and trig • Why - because you may wish to sidestep glRotate() etc. • These functions will of course work in fixed or float • Hence testability on desktop is high and immediate

Step 2 - portify 3D code • Isolate your 3D code if not already done • Minimize #include <gl/gl.h> • Modify 3D code so it is OpenGL / OpenGL ES agnostic • Modify it so it is floating point / fixed point agnostic • And obviously modify your data too • Make your world representable by 16.16

Step 3 - portify application code • Work out what maths absolutely must be floating-point • Replace everything else with real number class • But be really careful - for example • Really common case - distance between 2 points - Pythagoras • Squaring those numbers will blow up for almost all cases • Code defensively - implement a ‘radius’ function that will not blow up • OK, you could keep this example as floats • But floats are so very expensive without FPU • It’s a common operation, and it’s easy to get it right in fixed-point • Remember - conservation of CPU cycles is the challenge • The hardware developers and Khronos have taken care of the 3D • CPU cycles are precious, conserve them

Step 4 - port to the handheld platform • This step is really easy if the last 3 went well ... • Take cross-compiler • Turn on all the #ifdefs you prepared earlier • Type ‘make’ • Or under Embedded Visual C++ hit F7 • It will just work. Trust me, it will.

Case study - the Mobile Scene Graph • Framework for 3D applications • Initial implementation - desktop • Interactive landscape, architecture and garden design review • Straightforward design • Classic app + cull + draw, frustum culling • C++, STL, polymorphic, RTTI • Target platform PowerBook G3 500MHz / OpenGL / glut • Transitioned into • Desktop - interactive landscape, architecture and garden design review • Handheld - experimental testbed for OpenGL ES rendering • Target platforms • PowerBook G3 500MHz / OpenGL 1.4 / glut • Intel / Intrinsyc Carbonado / OpenGL ES 1.0 / egl • Great opportunity to take on a port • Aiming for 100% application source code compatibility • Aiming to deliver highest possible performance on desktop and handheld

MSG Implementation details • ‘MSGReal’ • Build-time switchable float or OpenGL ES 16.16 fixed point • C++ operators provide +-*/ and common type conversions • Functions provide trig, sqrt / recipsqrt • All expensive operations implemented by piecewise quadratics • Additional 4.12 ‘MSGShortFix’ type • Intermediate product fits into 32 bits, no double-length maths • Superbright unclamped colour accumulation • Reflection-mapping via quadratic approximation without overflow • Only 2 internal functions use floating-point • Plane fitter for frustum construction • Determinant calculation in matrix inverter

Porting realities - timescales • Approximately 3 man-months of portification • Difficult to measure accurately • Coding was in progress as portification began • Approximately 20,000 lines of code • Only 800 lines can see <gl/gl.h> • Just 8 #ifdefs in this module • i.e.if this is representative, the portification process is manageable • 2 evening porting sessions • Just 6 hours at the desk from ‘move code onto PC’ to ‘run on handheld’ • … and one evening should have been enough • Then performance tuning • Anticipated >30Hz was only 15-20Hz • Now tuned up to >40Hz with no change in geometric load

Porting realities - gotchas • Handheld specific • Performance not linear with clock for a variety of reasons • e.g. caching behaviour, driver behaviour, architectural • Limited container class and template support • Some C++ operations will hurt more than you expect • Very slow RTTI • STL list operations sort(), push_back(), pop_front() proved surprisingly expensive • 3D gotchas • Unanticipated differences in behaviour • E.g. multiple strips from single pointer setup – multiple TnL on Carbonado • Would benefit from gLDrawMultiElements • Short tristrip performance • Would benefit from gLDrawMultiElements!! • Best performance - glDrawElements(glTriangles) • Fixed-point to integer conversion in OpenGL ES interface

Demonstrations • MSGRefMap - arithmetic performance test • Single object, reflection mapped • Cull time virtually zero • Virtually all cycles spent in reflection-map code • This is fixed-point on all platforms • 16-bit skybox textures • MSGHurricane - frustum-culling test • 2048 objects in hierarchical terrain • unlit, 8-bit luminance texture • 7 animated aircraft • lit with 2 lights • 16-bit aircraft texture • 16-bit skybox textures

Performance • MSGRefMap • PowerBook floating point • OpenGL renderer - 116 Hz • NULL renderer - 1360 Hz • PowerBook fixed point • NULL renderer - 1620 Hz • Carbonado fixed point • OpenGL ES renderer - 35.9 Hz • NULL renderer - 668.4 Hz • Carbonado floating point • NULL renderer - 101.2 Hz • MSGHurricane • PowerBook floating point • OpenGL renderer - 122 Hz • NULL renderer - 1890 Hz • PowerBook fixed point • NULL renderer - 960 Hz • Carbonado fixed point • OpenGL ES renderer - 34.6 Hz • NULL renderer - 271.5 Hz • Carbonado floating point • NULL renderer - 46.25 Hz • Fixed-point code averages 6x faster than FP emulation • Despite data structure traversal and other non-arithmetic code • Despite fixed point reflection-mapping code in floating point version • This is a fast CPU, yet it is too slow in FP emulation running MSGHurricane

Last word on performance • The missing case - • Floating point application code • Fixed point framework / middleware • Estimated by isolating application cycles on Carbonado • Time spent in application = 11% of frame time (NULL renderer) • MSGHurricane • Fixed point frame time = 0.0037 sec • Floating point frame time = 0.021 sec • Mixed-mode frame = (89% * 0.0037) + (11% * 0.021) = 0.011 sec • Estimated 88Hz mixed-mode rate • Within 33mS budget • But scale processor back to 150MHz and it becomes too slow again • And this is just a demo - just splines, no physics, no gameplay • Floating-point emulation is just too slow for even the simplest case

Conclusions • The software migration process can be relatively painless • Source code should be ‘portified’ - i.e. made • 3D API agnostic • Isolate and encapsulate your 3D API interactions • Structure desktop code to be OpenGL ES friendly • Floating point agnostic • Abstract out your real number format • At minimum in middleware layer • Ideally allow fixed-point from application down to hardware • You can do all this from the safety of your workstation • No handheld platform debugging until project is mature • MSG ported to Carbonado in 2 evenings with just printf • And if you get it right • It will just port and just work - but may require some tuning • Performance will be high across platforms • Resulting software will be highly portable and reusable

Topics

Topics

Presentation Transcript

TOPICS

Topics

Topics

Topics

Topics

Topics

Topics:

Topics

Topics

Topics

Topics

Topics

TOPICS

Topics

Topics:

Topics

Topics

Topics

Topics