260 likes | 390 Vues
This presentation by Philip Wells and Guri Sohi at HPCA 2008 explores the challenges and implications of serializing instructions (SIs) in system-intensive workloads, emphasizing their significant impact on operating system performance. The talk covers the complexities of SIs, their frequency across different instruction set architectures (ISAs), and introduces the concept of effectively useless (EU) writes, which allows processors to execute out-of-order (OoO) while managing dependencies. The findings reveal that SIs can reduce OS performance by up to 45%, but effective management can improve performance by up to 35%.
E N D
Serializing Instructions in System Intensive Workloads (Amdahl’s Law Strikes Again) Philip Wells and Guri Sohi {pwells, sohi}@cs.wisc.edu HPCA Feb, 2008
Serializing instructions overview • Serializing instruction (SIs) have complex deps • Difficult to execute OoO, often serialize the pipeline • E.g. writes to control registers • SIs frequent in OS code, across ISAs • Reduce OS performance by 8-45% • Values produced by SIs are often effectively useless (EU) • EU prediction allows consumers to proceed • May read stale value, but execute correctly • Improves OS performance by 6-35% Philip Wells - HPCA 2008
Talk outline • Serializing instructions • Description, implementation & performance • Characterization • Frequency across 3 ISAs • Useful consumption • Effectively useless prediction • Overview & operation • Performance results • Summary Philip Wells - HPCA 2008
IG PRIV MG CLE TLE MM RED PEF AM IE AG What are SIs? • Talk focus: Writes to non-renamed control registers • e.g. explicit writes, exceptions & returns • Not renamed due to complex dependencies • Read by control logic at many pipeline stages • Difficult to execute OoO • Most processors serialize pipeline • Discussion of real implementations in paper Fetch Decode %pstate Execute Commit Philip Wells - HPCA 2008
Effects of Amdahl’s law Execution of OS code (Ideal SPARC) Fetch stall on SI (% of cycles) Philip Wells - HPCA 2008
SI discussion • Received little research • Mostly affects OS code • Largely absent in SPEC or short traces • Viewed as specific to a particular implementation • Our characterization shows that • SIs are important for system-intensive apps • Characterization similar across multiple ISAs • Implementations similar across multiple processors Philip Wells - HPCA 2008
Outline • Serializing instructions • Characterization • Frequency across 3 ISAs • Useful consumption • Effectively useless prediction • Summary Philip Wells - HPCA 2008
Characterization of SIs • Methodology • Several commercial workloads • SPARC, X86-64 & PowerPC platforms on Simics • ‘Normal’ SPARC: with register window and TLB traps • ‘Ideal’ SPARC: reg win traps removed & HW-fill TLB • Uniprocessor systems • Details in paper Philip Wells - HPCA 2008
SI frequency Frequent across ISAs Similar profile & dominated by register writes Frequent exceptions in normal SPARC Ideal SPARC X86 PowerPC ‘Normal’ SPARC Philip Wells - HPCA 2008
Effectively useless (EU) writes • Many non-renamed registers writes are EU • Produce a new value • Consumers read the value • But their execution is unaffected Philip Wells - HPCA 2008
EU characterization Dyn Dead [Butts & Sohi ‘02] Most values are quickly consumed, but not useful to the first consumers 30% of writes are consumed by the next instruction < 20% of writes are useful within 1023 instructions Zeus on Ideal SPARC, implicit consumers only Philip Wells - HPCA 2008
Why effectively useless? • Control registers have many fields • SIs write entire register, decode stage must serialize • But often only update one field • Turn off interrupts (from Solaris 9): • EU subsumes both • Dynamically dead [Butts ‘02] & silent writes [Lepak ‘00] rdpr %pstate, %o5 andn %o5, 2, %o4 wrpr %o4, 0, %pstate Serializing instr! Philip Wells - HPCA 2008
Outline • Serializing instructions • Characterization • Effectively useless prediction • Overview & operation • Performance results • Summary Philip Wells - HPCA 2008
Effectively useless prediction • Goals • Allow EU writes and consumers to execute OoO • Few changes to pipeline & datapath • Easy test to ensure consumers execute correctly • Overview • Allow consumers to proceed under certain conditions • Guarantee non-faulting consumers execute correctly Philip Wells - HPCA 2008
EU Prediction Table Was this write EU last time? 1 0 P B C WritePtr 0 pstate 1 0 1 2 1 0 0 0 - 1 fprs 0 1 0 5 1 0 0 0 - 0 0 0 0 - 1 0 0 0 - EU prediction operation SIs: 1) Make EU prediction 2) Update status • Outstanding Write Table • Status of writes to each control reg Fetch Decode Decode Consumers: 1) Check each control reg 2) Proceed if all writes are EU (may read stale value) Issue Execute Consumer Exception: 1) Squash if proceeded past EU write Write PC Write Back SIs: 1) Check for useful changes 2) Squash younger instr if useful cons 3) Update status & EU prediction table Commit Commit Philip Wells - HPCA 2008
What are useful changes? • Useful unless: 1) The write is silent (~14%) 2) Change will only affect faulting instructions (~65%) • Setting FEF field of %fprs to one • Interrupt example earlier • Several other common cases • Overly conservative • But captures most common cases • Satisfies goal of simple test Philip Wells - HPCA 2008
EU prediction methodology • OoO processor • 128-entry instr. window • 15 stage pipe • 32kB L1I/D, 1MB L2 • 265-cycle main mem • Simics MAI as a dynamic trace generator • Adapts to changes due to timing • Faithfully models wrong-path events • Ideal SPARC • Details in paper Philip Wells - HPCA 2008
EU prediction results OS Speedup Overall Philip Wells - HPCA 2008
Also in the paper • More characterization & results • Useless TLB writes • EU prediction accuracy • Large window processor • Two other ‘baseline’ implementations • Scoreboard • LateQuash • Discussion of SIs in real implementations: • Pentium M, Alpha 21264, PowerPC 750, UltraSPARC IIICu Philip Wells - HPCA 2008
Summary • Present first analysis of serializing instructions • Frequent across three ISAs • Limit OoO parallelism in OS code • Rival impact of L2 misses (8-45% for OS) • Many SI writes are effectively useless (EU) • Propose EU prediction • Predict writers and consumers can execute OoO • May read stale value, but execute properly anyway • 6-35% OS improvement (2-12% overall) • Not a panacea, but simple and works fairly well Philip Wells - HPCA 2008
Thank you! Questions, comments: pwells@cs.wisc.edu http://www.cs.wisc.edu/~pwells Philip Wells - HPCA 2008
Other SI implementations • Reminder: • Baseline blocks all younger instructions after SI • Technique 1: “Scoreboard” • Track outstanding SI writes (similar to OWT) • Determine which stage to block consumers • Identify independent instructions • Technique 2: “LateSquash” • Instructions following SI enter pipeline, execute OoO • Squashed just before SI executes Philip Wells - HPCA 2008
EU prediction results OS Speedup Overall Philip Wells - HPCA 2008
Why not value prediction? • Last value prediction for non-renamed registers • Can be modified to accurately predict many values • Can avoid serializing all non-renamed regs (not just EU) • Requires predicted value to be sent to every stage where it might be used • Avoiding this is the reason SIs exist in the first place Philip Wells - HPCA 2008
Explicit vs. implicit consumers • Explicit consumers • Name their operands & use them at execute stage • Implicit consumers • Don’t name them & use values at a variety of pipeline stages • Are the reason writes to non-renamed regs serialize rdpr %pstate, %o5 andn %o5, 2, %o4 wrpr %o4, 0, %pstate brnz %o1, 0x5ca8 sethi %hi(0x140), %o3 … Explicit consumer of %pstate SI Implicit consumers of %pstate Philip Wells - HPCA 2008