1 / 26

Serializing Instructions in System Intensive Workloads

Serializing Instructions in System Intensive Workloads. (Amdahl’s Law Strikes Again). Philip Wells and Guri Sohi {pwells, sohi}@cs.wisc.edu HPCA Feb, 2008. Serializing instructions overview. Serializing instruction (SIs) have complex deps

evonne
Télécharger la présentation

Serializing Instructions in System Intensive Workloads

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Serializing Instructions in System Intensive Workloads (Amdahl’s Law Strikes Again) Philip Wells and Guri Sohi {pwells, sohi}@cs.wisc.edu HPCA Feb, 2008

  2. Serializing instructions overview • Serializing instruction (SIs) have complex deps • Difficult to execute OoO, often serialize the pipeline • E.g. writes to control registers • SIs frequent in OS code, across ISAs • Reduce OS performance by 8-45% • Values produced by SIs are often effectively useless (EU) • EU prediction allows consumers to proceed • May read stale value, but execute correctly • Improves OS performance by 6-35% Philip Wells - HPCA 2008

  3. Talk outline • Serializing instructions • Description, implementation & performance • Characterization • Frequency across 3 ISAs • Useful consumption • Effectively useless prediction • Overview & operation • Performance results • Summary Philip Wells - HPCA 2008

  4. IG PRIV MG CLE TLE MM RED PEF AM IE AG What are SIs? • Talk focus: Writes to non-renamed control registers • e.g. explicit writes, exceptions & returns • Not renamed due to complex dependencies • Read by control logic at many pipeline stages • Difficult to execute OoO • Most processors serialize pipeline • Discussion of real implementations in paper Fetch Decode %pstate Execute Commit Philip Wells - HPCA 2008

  5. Effects of Amdahl’s law Execution of OS code (Ideal SPARC) Fetch stall on SI (% of cycles) Philip Wells - HPCA 2008

  6. SI discussion • Received little research • Mostly affects OS code • Largely absent in SPEC or short traces • Viewed as specific to a particular implementation • Our characterization shows that • SIs are important for system-intensive apps • Characterization similar across multiple ISAs • Implementations similar across multiple processors Philip Wells - HPCA 2008

  7. Outline • Serializing instructions • Characterization • Frequency across 3 ISAs • Useful consumption • Effectively useless prediction • Summary Philip Wells - HPCA 2008

  8. Characterization of SIs • Methodology • Several commercial workloads • SPARC, X86-64 & PowerPC platforms on Simics • ‘Normal’ SPARC: with register window and TLB traps • ‘Ideal’ SPARC: reg win traps removed & HW-fill TLB • Uniprocessor systems • Details in paper Philip Wells - HPCA 2008

  9. SI frequency Frequent across ISAs Similar profile & dominated by register writes Frequent exceptions in normal SPARC Ideal SPARC X86 PowerPC ‘Normal’ SPARC Philip Wells - HPCA 2008

  10. Effectively useless (EU) writes • Many non-renamed registers writes are EU • Produce a new value • Consumers read the value • But their execution is unaffected Philip Wells - HPCA 2008

  11. EU characterization Dyn Dead [Butts & Sohi ‘02] Most values are quickly consumed, but not useful to the first consumers 30% of writes are consumed by the next instruction < 20% of writes are useful within 1023 instructions Zeus on Ideal SPARC, implicit consumers only Philip Wells - HPCA 2008

  12. Why effectively useless? • Control registers have many fields • SIs write entire register, decode stage must serialize • But often only update one field • Turn off interrupts (from Solaris 9): • EU subsumes both • Dynamically dead [Butts ‘02] & silent writes [Lepak ‘00] rdpr %pstate, %o5 andn %o5, 2, %o4 wrpr %o4, 0, %pstate Serializing instr! Philip Wells - HPCA 2008

  13. Outline • Serializing instructions • Characterization • Effectively useless prediction • Overview & operation • Performance results • Summary Philip Wells - HPCA 2008

  14. Effectively useless prediction • Goals • Allow EU writes and consumers to execute OoO • Few changes to pipeline & datapath • Easy test to ensure consumers execute correctly • Overview • Allow consumers to proceed under certain conditions • Guarantee non-faulting consumers execute correctly Philip Wells - HPCA 2008

  15. EU Prediction Table Was this write EU last time? 1 0 P B C WritePtr 0 pstate 1 0 1 2 1 0 0 0 - 1 fprs 0 1 0 5 1 0 0 0 - 0 0 0 0 - 1 0 0 0 - EU prediction operation SIs: 1) Make EU prediction 2) Update status • Outstanding Write Table • Status of writes to each control reg Fetch Decode Decode Consumers: 1) Check each control reg 2) Proceed if all writes are EU (may read stale value) Issue  Execute Consumer Exception: 1) Squash if proceeded past EU write Write PC Write Back SIs: 1) Check for useful changes 2) Squash younger instr if useful cons 3) Update status & EU prediction table Commit Commit Philip Wells - HPCA 2008

  16. What are useful changes? • Useful unless: 1) The write is silent (~14%) 2) Change will only affect faulting instructions (~65%) • Setting FEF field of %fprs to one • Interrupt example earlier • Several other common cases • Overly conservative • But captures most common cases • Satisfies goal of simple test Philip Wells - HPCA 2008

  17. EU prediction methodology • OoO processor • 128-entry instr. window • 15 stage pipe • 32kB L1I/D, 1MB L2 • 265-cycle main mem • Simics MAI as a dynamic trace generator • Adapts to changes due to timing • Faithfully models wrong-path events • Ideal SPARC • Details in paper Philip Wells - HPCA 2008

  18. EU prediction results OS Speedup Overall Philip Wells - HPCA 2008

  19. Also in the paper • More characterization & results • Useless TLB writes • EU prediction accuracy • Large window processor • Two other ‘baseline’ implementations • Scoreboard • LateQuash • Discussion of SIs in real implementations: • Pentium M, Alpha 21264, PowerPC 750, UltraSPARC IIICu Philip Wells - HPCA 2008

  20. Summary • Present first analysis of serializing instructions • Frequent across three ISAs • Limit OoO parallelism in OS code • Rival impact of L2 misses (8-45% for OS) • Many SI writes are effectively useless (EU) • Propose EU prediction • Predict writers and consumers can execute OoO • May read stale value, but execute properly anyway • 6-35% OS improvement (2-12% overall) • Not a panacea, but simple and works fairly well Philip Wells - HPCA 2008

  21. Thank you! Questions, comments: pwells@cs.wisc.edu http://www.cs.wisc.edu/~pwells Philip Wells - HPCA 2008

  22. Backup Slides

  23. Other SI implementations • Reminder: • Baseline blocks all younger instructions after SI • Technique 1: “Scoreboard” • Track outstanding SI writes (similar to OWT) • Determine which stage to block consumers • Identify independent instructions • Technique 2: “LateSquash” • Instructions following SI enter pipeline, execute OoO • Squashed just before SI executes Philip Wells - HPCA 2008

  24. EU prediction results OS Speedup Overall Philip Wells - HPCA 2008

  25. Why not value prediction? • Last value prediction for non-renamed registers • Can be modified to accurately predict many values • Can avoid serializing all non-renamed regs (not just EU) • Requires predicted value to be sent to every stage where it might be used • Avoiding this is the reason SIs exist in the first place    Philip Wells - HPCA 2008

  26. Explicit vs. implicit consumers • Explicit consumers • Name their operands & use them at execute stage • Implicit consumers • Don’t name them & use values at a variety of pipeline stages • Are the reason writes to non-renamed regs serialize rdpr %pstate, %o5 andn %o5, 2, %o4 wrpr %o4, 0, %pstate brnz %o1, 0x5ca8 sethi %hi(0x140), %o3 … Explicit consumer of %pstate SI Implicit consumers of %pstate Philip Wells - HPCA 2008

More Related