1 / 10

Overview of Datacenter Architecture and Challenges in Google and Microsoft Environments

This document provides an in-depth analysis of datacenter architecture, focusing on rack organization, server specifications, and the efficiency of memory and disk usage. It highlights various configurations such as DRAM and disk storage capacities across different scales, from single machines to entire datacenter clusters. The notes also cover typical failures encountered in major datacenters, including overheating, power distribution unit (PDU) failures, and network issues, emphasizing the frequency and impact of these challenges on operational performance.

aradia
Télécharger la présentation

Overview of Datacenter Architecture and Challenges in Google and Microsoft Environments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Google Datacenter CS 142 Lecture Notes: Datacenters

  2. Rack: 50 machines DRAM: 200-800GB @ 300 µs Disk: 100TB @ 10ms Single server: 4-8 cores DRAM: 4-16GB @ 100ns Disk: 2 TB @10ms Datacenter Organization Row/cluster: • 30+ racks • DRAM: 6-24TB @ 500 µs • Disk: 3 PB @ 10ms CS 142 Lecture Notes: Datacenters

  3. Sun Containers CS 142 Lecture Notes: Datacenters

  4. Sun Containers, cont'd CS 142 Lecture Notes: Datacenters

  5. Google Containers CS 142 Lecture Notes: Datacenters

  6. Microsoft Containers CS 142 Lecture Notes: Datacenters

  7. Microsoft Containers, cont'd CS 142 Lecture Notes: Datacenters

  8. Failures are Frequent Typical first year for a new cluster (Jeff Dean, Google): • ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover) • ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back) • ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours) • ~1 network rewiring (rolling ~5% of machines down over 2-day span) • ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) • ~5 racks go wonky (40-80 machines see 50% packet loss) • ~8 network maintenances (4 might cause ~30-minute random connectivity losses) • ~12 router reloads (takes out DNS and external vips for a couple minutes) • ~3 router failures (have to immediately pull traffic for an hour) • ~dozens of minor 30-second blips for DNS • ~1000 individual machine failures • ~thousands of hard drive failures • Slow disks, bad memory, misconfigured machines, flaky machines, etc. • Long distance links: wild dogs, sharks, dead horses, drunken hunters, etc. CS 142 Lecture Notes: Datacenters

  9. How Many Datacenters? • 1-10 datacenter servers/human? • 100,000 servers/datacenter • 80-90% of general-purpose computing in datacenters? RAMCloud

  10. CS 142 Lecture Notes: Datacenters

More Related