Warehouse-Scale Computing

Warehouse-Scale Computing Mu Li, KiryongHa 10/17/2012 15-740 Computer Architecture

Overview • Motivation • Explore architectural issues as the computing moves toward the could • Impact of sharing memory subsystem resources (LLC, memory bandwidth ..) • Maximize resource utilization by co-locating applications without hurting QoS • Inefficiencies on traditional processors for running scale-out workloads

Overview

Impact of memory subsystem sharing

Impact of memory subsystem sharing • Motivation & Problem definition • Machines have multi-core, multi-socket • For better utilization, applications should share Last Level Cache(LLC) / Front Side Bus (FSB)  It is important to understand the memory sharing interaction between (datacenter) applications

Impact of thread-to-core mapping Sharing Cache SeparateFSBs (XX..XX..) Sharing CacheSharingFSBs (XXXX….) Separate CacheSeparateFSBs (X.X.X.X.)

Impact of thread-to-core mapping - Performance varies up to 20% - Each Application has different trend. - TTC behavior changes depending on co-located application. <content analyzer co-located with other>

Observation • Performance can significantly swing simply based on how application threads are mapped to cores. • Best TTC mapping changes depends on co-located program. • Application characteristics that impact performance • Memory bus usage, Cache line sharing, Cache footprint • Ex) CONTENT ANALYZER has high bus usage, little cache sharing, large cache footprint  Works better if it doesn’t share LLC and FSB STITCH use more Bus bandwidth, so co-located CONTENT ANALYZER will have contention on FSB

Increasing Utilization in Warehouse scale Computers via Co-location

Increasing Utilization via Co-location • Motivation • Cloud computing wants to get higher resource utilization. • However, overprovisioning is used to ensure the performance isolation for latency-sensitive task, which lowers the utilization.  Needprecise prediction in shared resource for better utilization without violating QoS. <Google’s web search QoS co-located with other products>

Bubble-up Methodology • QoSsensitivity curve (  ) • Get the sensitivity of the application by iteratively increasing the amount of pressure to memory subsystem • Bubble score (  ) • Get amount of pressure that the application causes on a reporter <sensitivity curve for Bigtable> <sensitivity curve> <Pressure score>

Better Utilization • Now we know • how QoS changes depending on bubble size (QoS curve) • how the application can affect to others (bubble number) • Can co-locate applications estimatiing changes on QoS <Utilization improvement with search-render under each QoS>

Scale-out workload

Scale-out workload • Examples: • Data Severing • Mapreduce • Media Streaming • SAT Solver • Web Frontend • Web Search

Execution-time breakdown • A major part of time is waiting for caches misses  A clear micro-architectural mismatch

Frontend ineffficiencies • Cores idle due to high instruction-cache miss rates • L2 caches increase average I-fetch latency • Excessive LLC capacity leads to long I-fetch latency • How to improve? • Bring instructions closer to the cores

Core inefficiencies • Low instruction level parallelism precludes effectively using the full core width • Low memory level parallelism underutilizes reorder buffers and load-store queues. • How to improve? • Run many things together: multi-threaded multi-core architecture

Data-access inefficiencies • Large LLC consumes area, but does not improve performance • Simple data prefetchers are ineffective • How to improve? • Reduce LLC, leave place for processers

Bandwidth inefficiencies • Lack of data sharing deprecates coherence and connectivity • Off-chip bandwidth exceeds needs by an order of magnitude • How to improve? • Scale back on-chip interconnect and off-chip memory bus to give place for processors

Scale-out processors • So, too large LLC, interconnect, memory bus, but not enough processors • Here comes a better one: Improve throughput by 5x-6.5x!

Q&A or Discussion

Supplement slides

Datacenter Applications - Google’s production application

Key takeaways • TTC behavior is mostly determined by • Memory bus usage (for FSB sharing) • Data sharing: Cache line sharing • Cache footprint: Use last level cache miss to estimate the foot print size • Example • CONTENT ANALYZER has high bus usage, little cache sharing, large cache footprint  Works better if it does not share LLC and FSB • Stich actually uses more Bus bandwidth, so it’s better for CONTENT ANALYZER not to share FSB with stitch

1% prediction error on average Prediction accuracy for pairwise co-locations of Google applications

Warehouse-Scale Computing

Warehouse-Scale Computing

Presentation Transcript

Cloud Computing Economies of Scale

Data Center Scale Computing

Large Scale Sky Computing Applications with Nimbus

Warehouse-Scale Computing

Some thoughts on Extreme Scale Computing

Petabyte-scale computing for LHC

Architectures for Extreme-Scale Computing

Computing Actual Areas from a Scale Drawing

Large Scale Distributed Computing Systems

Multi-scale computing over heterogeneous resources

Large Scale Computing Systems

Toward Mega-Scale Computing with pMatlab

SQL Server Parallel Data Warehouse: Supporting Large Scale Analytics

CSCI 365 – Introduction to Large Scale Computing

Biologically Inspired Computing, Nanoelectronic (Molecular Scale) Architectures

Web Scale Computing

X10: Computing at scale

Large-Scale Computing with Grids

Exa-Scale Volunteer Computing

Toward Mega-Scale Computing with pMatlab