Faster unicores are still needed

Faster unicores are still needed André Seznec INRIA/IRISA

DAL: Defying Amdahl’s Law • ERC advanced grant to A. Seznec (2011-2016) DAL objective: « Given that Amdahl’s Law is Forever propose (impact) the microarchitecture of the 2020 General Purpose manycore »

Multicores are everywhere • Multicores in servers, desktop, laptops • 2-4-8-12 O-O-O cores • Multicores in smart phones, tablets • 2-4-(not that simple) cores • Manycores for niche markets • 48-80-100 simple cores • Tilera, Intel Phi

Multicore/multithread for everyone • End-user : improved usage comfort • Can surf on the web and hear MP3 • Parallel performance for the masses? • Very few (scalable) mainstream // apps • Graphics • Niche market segments

No parallel software bonanza in the near future • Inheritage of sequential legacy codes • Parallelism is not cost-effective for most apps • Sequential programming will remain dominant

Inheritage of sequential legacy codes • Software is more resilient than hardware • Apps are surviving/evolving for years, often decades • Very few parallel apps now • Unlikely redevelopment of parallel apps from scratch • Computing intensive sections will be parallelized • But significant code sections will remain sequential

Parallelism is not cost-effective for most apps • Why parallelism ? • Only for performance • But costly: • Difficult, man-time consuming, error prone • Poorly portable: functionality and performance

Sequential programming will remain dominant • Just easier • The « Joe » programmer • Portability, maintenance, debug • + compiler to parallelize • + parallel libraries • + software components (developped by experts)

Looking backwards

2002: The End of the Uniprocessor Road • Power and temperature walls: • Stopped the frequency increase • 2x transistors: 5 %? 10 % ? perf. (if any) economical logic : buy smaller chips ! IC industry needs to sell new (expensive) chips: Marketing: « You need hyperthreading, 2, 4, 8 cores »

Marketing multicores to the masses2002- .. SMT Dual-core Quad-core SMT SMT GREAT !!

And now ? The end user is not such a fool ..

Following the trend: 2020 • Silicon area, power envelope • ≈ 100 Nehalem class cores or • ≈ 1,000 simple cores (VLIW, in-order superscalar)

Amdahl’s Law seq. parallel “Cannot run faster than sequential part”

OK, parallel applications do not scale • Our recent study on parallel application scaling: • In general: bp> -1 : sublinear scaling • Sometimes: bs > 0 : sequential part increases Execution time Input set Processor number

But let us use a naive (overoptimistic) model • A parallel application: • Parallel section: can use 1000 processors • Sequential section: run on a single processor SEQ: constant fraction of sequential code linear speed-up

Complex cores against simple cores • CC: 100 complex vs SC :1000 simple cores with complex 2X faster than simple if SEQ > 0.8 % then CC > SC

And hybrid SC + CC ? CC_SC: • 50 complex • 500 simple if SEQ> 0.2% then CC_SC > SC

And if .. • Use a huge amount of resource for a single core: 10X the area of the complex core 10X the power of the complex core  Use all the uniprocessor techniques • Very wide issue (8 – 16 ?), Ultimate frequency ( « heat and run »), Helper threads, Value prediction • Invent new techniques Ultra Complex cores

DAL architecture proposition • Heterogeneous architecture: • A few ultra complex cores • to enable performance on sequential codes and/or critical sections • A « sea » of simple cores • for parallel sections

For the naive model « DAL » : UC_SC 5 ultra complex cores + 500 simple cores • If SEQ > 0.13 % then « DAL » > SC • « DAL » always better than UC, CC, CC_SC

Need for research on faster unicores • Silicon area is 2ndorder issue • can use the area of 10 complexcores • Power/energyis 2ndorder issue can use the power of 10 complexcores

On going work:Revisiting Value Prediction with Arthur Pérais

Value prediction ?Lipasti et al, Gabbay and Mendelson 1996 Basic idea: • Eliminate (some) true data dependencies through predicting instruction results +1 +2 +2 +3 +3 +3 +3 I0 I0 I1 I1 I3 I3 I4 I4 I5 I5

Value Prediction: • Large body of research 96-02 • Quite efficient: • Surprisingly high number of predictable instructions • Not implemented so far: • High cost : is it still relevant now ? • High penalty on misp.: don’t lose all the benefit

Last Value Predictor • Just predict the last produced value • Set Associative Table • Use confidence counters AnalogywithPC-basedbranchprediction

Stride value predictor • Add last value + (last difference) PC + Analogywithstrideprefetcher, but alsowithlooppredictor

Finite Context Method predictors Use history of the last values by the instruction PC Analogy with local history branch predictor

And global value history branch • Just no sense ! • Need the history of the last instructions • Too late !! • But global branch history !?! • ITTAGE is the state-of-the-art indirect branch predictor !! • And it predicts values !

ITTAGE h[0:L1] pc h[0:L3] pc pc h[0:L2] pc 32 32 1 32 1 32 1 =? =? =? 32 32 prediction VTAGE Tagless base Predictor • Longest matching component provides the prediction

The repair issue on misprediction I0 I1 I3 I4 I5 misprediction

Pipeline squash • Acts as on exception, branch misprediction • Very high penalty I0 I1 I3 I4 I5

Selective replay • Cancel all dependent instructions, but save the others • Very complex to implement: • Unlimited dependence chains I0 I1 I3 I4 I5

Critical path • Predicted value needed late in the pipeline: • Disptach time is sufficient • Except that:

A FCM implementation issue PC Speculative Window Might be a critical path Must take the last local values

Critical path on the stride value predictor Can be reused on the next cycle PC + Stride AND spec. last value must be high confidence Speculative Window

Experiments • 8-way superscalar, deep pipeline • Use prediction only on high confidence • 3-bit counters + saturated • + reset

Squashing

Selective replay

High confidence through probabilistic counters • Need for very high confidence: • 95 % accuracy unsufficient • >> 99 % needed TRADING ACCURACY AGAINST COVERAGE • Saturation with only very low probability • 1/32, 1/256

Squashing

And hybrids

Current status • All value predictors amenable to very high confidence • No complex selective repair needed • No need for local value prediction • No complex critical path in the local value predictor

On going work:Selective Prediction of Predicated Instructions with Nathanael Prémillieu

Who cares about predicated instructions ? • CMOV in all ISA • ARM, Itanium : • All instructions are predicated out-of-orderexecution: just a nightmare

The multiple definition problem Before renaming: Mapping Table I1: R1 R2, R3 (p) I2: R4 R1, R2 After renaming: I1: P1 P15, P22 (p) I2: P13 ???, P15

Expansion/Serialization After renaming: I1a: P1 P15, P22 I1b: P27 (p) ? P1, P11 I2: P13 P27, P15 • Create an extra instruction • Force I1bI2 dependency

Aggressive serialization I1: P18 (p) ? (op P15, P22) : P23 I2: P13 P18, P15 • No expansion, but an extra operand on I1: • complexity on register file, issue logic, bypass network • Force I1I2 dependency

Predicting the predicates • branchhistory or branch+predicatehistory to predict the predicates • Eliminate multiple definitions • Predicatemispredictionsbecomebranchmispredictions

Not that convincing !

Faster unicores are still needed

Faster unicores are still needed

Presentation Transcript

Are you needed Melbourne Electrician

We Are Still Relatives

We are still here?

PESTICIDE SAFETY EDUCATION: STILL NEEDED?

Are Books Still Popular?

Are Humans Still Evolving?

Are You Still QSL’ing?

Why Revisions are Needed

Are LCSH still effective?

Seats are Still Available

Why are valuations needed?

Why Networking Standards Are Needed

Why are criminal attorneys needed

When Are ECGs Needed

Your views are needed

Are LCSH still effective?

Where services are needed

Faster! Faster!

Why Commercial Tables are Needed