250 likes | 363 Vues
This document explores architectural challenges in warehouse-scale computing as applications increasingly share resources. It emphasizes the impact of memory subsystem sharing, including the Last Level Cache (LLC) and memory bandwidth, on performance and quality of service (QoS). The discussion includes co-locating applications to improve resource utilization without sacrificing performance, addressing the inefficiencies present in traditional processors for scale-out workloads. Various scenarios demonstrate how thread-to-core mapping and memory usage affect overall performance in data center environments.
E N D
Warehouse-Scale Computing Mu Li, KiryongHa 10/17/2012 15-740 Computer Architecture
Overview • Motivation • Explore architectural issues as the computing moves toward the could • Impact of sharing memory subsystem resources (LLC, memory bandwidth ..) • Maximize resource utilization by co-locating applications without hurting QoS • Inefficiencies on traditional processors for running scale-out workloads
Impact of memory subsystem sharing • Motivation & Problem definition • Machines have multi-core, multi-socket • For better utilization, applications should share Last Level Cache(LLC) / Front Side Bus (FSB) It is important to understand the memory sharing interaction between (datacenter) applications
Impact of thread-to-core mapping Sharing Cache SeparateFSBs (XX..XX..) Sharing CacheSharingFSBs (XXXX….) Separate CacheSeparateFSBs (X.X.X.X.)
Impact of thread-to-core mapping - Performance varies up to 20% - Each Application has different trend. - TTC behavior changes depending on co-located application. <content analyzer co-located with other>
Observation • Performance can significantly swing simply based on how application threads are mapped to cores. • Best TTC mapping changes depends on co-located program. • Application characteristics that impact performance • Memory bus usage, Cache line sharing, Cache footprint • Ex) CONTENT ANALYZER has high bus usage, little cache sharing, large cache footprint Works better if it doesn’t share LLC and FSB STITCH use more Bus bandwidth, so co-located CONTENT ANALYZER will have contention on FSB
Increasing Utilization in Warehouse scale Computers via Co-location
Increasing Utilization via Co-location • Motivation • Cloud computing wants to get higher resource utilization. • However, overprovisioning is used to ensure the performance isolation for latency-sensitive task, which lowers the utilization. Needprecise prediction in shared resource for better utilization without violating QoS. <Google’s web search QoS co-located with other products>
Bubble-up Methodology • QoSsensitivity curve ( ) • Get the sensitivity of the application by iteratively increasing the amount of pressure to memory subsystem • Bubble score ( ) • Get amount of pressure that the application causes on a reporter <sensitivity curve for Bigtable> <sensitivity curve> <Pressure score>
Better Utilization • Now we know • how QoS changes depending on bubble size (QoS curve) • how the application can affect to others (bubble number) • Can co-locate applications estimatiing changes on QoS <Utilization improvement with search-render under each QoS>
Scale-out workload • Examples: • Data Severing • Mapreduce • Media Streaming • SAT Solver • Web Frontend • Web Search
Execution-time breakdown • A major part of time is waiting for caches misses A clear micro-architectural mismatch
Frontend ineffficiencies • Cores idle due to high instruction-cache miss rates • L2 caches increase average I-fetch latency • Excessive LLC capacity leads to long I-fetch latency • How to improve? • Bring instructions closer to the cores
Core inefficiencies • Low instruction level parallelism precludes effectively using the full core width • Low memory level parallelism underutilizes reorder buffers and load-store queues. • How to improve? • Run many things together: multi-threaded multi-core architecture
Data-access inefficiencies • Large LLC consumes area, but does not improve performance • Simple data prefetchers are ineffective • How to improve? • Reduce LLC, leave place for processers
Bandwidth inefficiencies • Lack of data sharing deprecates coherence and connectivity • Off-chip bandwidth exceeds needs by an order of magnitude • How to improve? • Scale back on-chip interconnect and off-chip memory bus to give place for processors
Scale-out processors • So, too large LLC, interconnect, memory bus, but not enough processors • Here comes a better one: Improve throughput by 5x-6.5x!
Datacenter Applications - Google’s production application
Key takeaways • TTC behavior is mostly determined by • Memory bus usage (for FSB sharing) • Data sharing: Cache line sharing • Cache footprint: Use last level cache miss to estimate the foot print size • Example • CONTENT ANALYZER has high bus usage, little cache sharing, large cache footprint Works better if it does not share LLC and FSB • Stich actually uses more Bus bandwidth, so it’s better for CONTENT ANALYZER not to share FSB with stitch
1% prediction error on average Prediction accuracy for pairwise co-locations of Google applications