Serializing Instructions in System Intensive Workloads

Serializing Instructions in System Intensive Workloads (Amdahl’s Law Strikes Again) Philip Wells and Guri Sohi {pwells, sohi}@cs.wisc.edu HPCA Feb, 2008

Serializing instructions overview • Serializing instruction (SIs) have complex deps • Difficult to execute OoO, often serialize the pipeline • E.g. writes to control registers • SIs frequent in OS code, across ISAs • Reduce OS performance by 8-45% • Values produced by SIs are often effectively useless (EU) • EU prediction allows consumers to proceed • May read stale value, but execute correctly • Improves OS performance by 6-35% Philip Wells - HPCA 2008

Talk outline • Serializing instructions • Description, implementation & performance • Characterization • Frequency across 3 ISAs • Useful consumption • Effectively useless prediction • Overview & operation • Performance results • Summary Philip Wells - HPCA 2008

IG PRIV MG CLE TLE MM RED PEF AM IE AG What are SIs? • Talk focus: Writes to non-renamed control registers • e.g. explicit writes, exceptions & returns • Not renamed due to complex dependencies • Read by control logic at many pipeline stages • Difficult to execute OoO • Most processors serialize pipeline • Discussion of real implementations in paper Fetch Decode %pstate Execute Commit Philip Wells - HPCA 2008

Effects of Amdahl’s law Execution of OS code (Ideal SPARC) Fetch stall on SI (% of cycles) Philip Wells - HPCA 2008

SI discussion • Received little research • Mostly affects OS code • Largely absent in SPEC or short traces • Viewed as specific to a particular implementation • Our characterization shows that • SIs are important for system-intensive apps • Characterization similar across multiple ISAs • Implementations similar across multiple processors Philip Wells - HPCA 2008

Outline • Serializing instructions • Characterization • Frequency across 3 ISAs • Useful consumption • Effectively useless prediction • Summary Philip Wells - HPCA 2008

Characterization of SIs • Methodology • Several commercial workloads • SPARC, X86-64 & PowerPC platforms on Simics • ‘Normal’ SPARC: with register window and TLB traps • ‘Ideal’ SPARC: reg win traps removed & HW-fill TLB • Uniprocessor systems • Details in paper Philip Wells - HPCA 2008

SI frequency Frequent across ISAs Similar profile & dominated by register writes Frequent exceptions in normal SPARC Ideal SPARC X86 PowerPC ‘Normal’ SPARC Philip Wells - HPCA 2008

Effectively useless (EU) writes • Many non-renamed registers writes are EU • Produce a new value • Consumers read the value • But their execution is unaffected Philip Wells - HPCA 2008

EU characterization Dyn Dead [Butts & Sohi ‘02] Most values are quickly consumed, but not useful to the first consumers 30% of writes are consumed by the next instruction < 20% of writes are useful within 1023 instructions Zeus on Ideal SPARC, implicit consumers only Philip Wells - HPCA 2008

Why effectively useless? • Control registers have many fields • SIs write entire register, decode stage must serialize • But often only update one field • Turn off interrupts (from Solaris 9): • EU subsumes both • Dynamically dead [Butts ‘02] & silent writes [Lepak ‘00] rdpr %pstate, %o5 andn %o5, 2, %o4 wrpr %o4, 0, %pstate Serializing instr! Philip Wells - HPCA 2008

Outline • Serializing instructions • Characterization • Effectively useless prediction • Overview & operation • Performance results • Summary Philip Wells - HPCA 2008

Effectively useless prediction • Goals • Allow EU writes and consumers to execute OoO • Few changes to pipeline & datapath • Easy test to ensure consumers execute correctly • Overview • Allow consumers to proceed under certain conditions • Guarantee non-faulting consumers execute correctly Philip Wells - HPCA 2008

EU Prediction Table Was this write EU last time? 1 0 P B C WritePtr 0 pstate 1 0 1 2 1 0 0 0 - 1 fprs 0 1 0 5 1 0 0 0 - 0 0 0 0 - 1 0 0 0 - EU prediction operation SIs: 1) Make EU prediction 2) Update status • Outstanding Write Table • Status of writes to each control reg Fetch Decode Decode Consumers: 1) Check each control reg 2) Proceed if all writes are EU (may read stale value) Issue  Execute Consumer Exception: 1) Squash if proceeded past EU write Write PC Write Back SIs: 1) Check for useful changes 2) Squash younger instr if useful cons 3) Update status & EU prediction table Commit Commit Philip Wells - HPCA 2008

What are useful changes? • Useful unless: 1) The write is silent (~14%) 2) Change will only affect faulting instructions (~65%) • Setting FEF field of %fprs to one • Interrupt example earlier • Several other common cases • Overly conservative • But captures most common cases • Satisfies goal of simple test Philip Wells - HPCA 2008

EU prediction methodology • OoO processor • 128-entry instr. window • 15 stage pipe • 32kB L1I/D, 1MB L2 • 265-cycle main mem • Simics MAI as a dynamic trace generator • Adapts to changes due to timing • Faithfully models wrong-path events • Ideal SPARC • Details in paper Philip Wells - HPCA 2008

EU prediction results OS Speedup Overall Philip Wells - HPCA 2008

Also in the paper • More characterization & results • Useless TLB writes • EU prediction accuracy • Large window processor • Two other ‘baseline’ implementations • Scoreboard • LateQuash • Discussion of SIs in real implementations: • Pentium M, Alpha 21264, PowerPC 750, UltraSPARC IIICu Philip Wells - HPCA 2008

Summary • Present first analysis of serializing instructions • Frequent across three ISAs • Limit OoO parallelism in OS code • Rival impact of L2 misses (8-45% for OS) • Many SI writes are effectively useless (EU) • Propose EU prediction • Predict writers and consumers can execute OoO • May read stale value, but execute properly anyway • 6-35% OS improvement (2-12% overall) • Not a panacea, but simple and works fairly well Philip Wells - HPCA 2008

Thank you! Questions, comments: pwells@cs.wisc.edu http://www.cs.wisc.edu/~pwells Philip Wells - HPCA 2008

Backup Slides

Other SI implementations • Reminder: • Baseline blocks all younger instructions after SI • Technique 1: “Scoreboard” • Track outstanding SI writes (similar to OWT) • Determine which stage to block consumers • Identify independent instructions • Technique 2: “LateSquash” • Instructions following SI enter pipeline, execute OoO • Squashed just before SI executes Philip Wells - HPCA 2008

EU prediction results OS Speedup Overall Philip Wells - HPCA 2008

Why not value prediction? • Last value prediction for non-renamed registers • Can be modified to accurately predict many values • Can avoid serializing all non-renamed regs (not just EU) • Requires predicted value to be sent to every stage where it might be used • Avoiding this is the reason SIs exist in the first place    Philip Wells - HPCA 2008

Explicit vs. implicit consumers • Explicit consumers • Name their operands & use them at execute stage • Implicit consumers • Don’t name them & use values at a variety of pipeline stages • Are the reason writes to non-renamed regs serialize rdpr %pstate, %o5 andn %o5, 2, %o4 wrpr %o4, 0, %pstate brnz %o1, 0x5ca8 sethi %hi(0x140), %o3 … Explicit consumer of %pstate SI Implicit consumers of %pstate Philip Wells - HPCA 2008

Serializing Instructions in System Intensive Workloads

Serializing Instructions in System Intensive Workloads

Presentation Transcript

Computer System Components Presentation Instructions

Memory System Characterization of Commercial Workloads

A Comparison of File System Workloads

Academic Workloads

ICRA Panel System Installation Instructions

Type of Workloads

Predicting System Performance for Multi-tenant Database Workloads

An Update-Aware Storage System for Low-Locality Update-Intensive Workloads

Grant System: Application Instructions

Benchmarking A Networked Backup System Using Natural Workloads

ENF for Serializing Graphs in XML

ANALYZING STORAGE SYSTEM WORKLOADS

SYSTEM-LEVEL PERFORMANCE METRICS FOR MULTIPROGRAM WORKLOADS

Educator Certification System Instructions

Serializing ProDataSets to JSON

Types of Workloads

Workloads

Memory System Characterization of Commercial Workloads

System Support for Data-Intensive Applications

Memory System Characterization of Commercial Workloads

System Support for Data-Intensive Applications