Data Center Networks

Data Center Networks CS 401/601 Computer Network Systems Mehmet Gunes Slides modified from: Mohammad Alizadeh, Albert Greenberg, Changhoon Kim, Srinivasan Seshan

What are Data Centers? • Large facilities with 10s of thousands of networked servers • Compute, storage, and networking working in concert • “Warehouse-Scale Computers” • Huge investment: ~ 0.5 billion for large datacenter

Data Center Costs The Cost of a Cloud: Research Problems in Data Center Networks. SigcommCCR 2009. Greenberg, Hamilton, Maltz, Patel. • *3 yr amortization for servers, 15 yr for infrastructure; 5% cost of money

Server Costs 30% utilization considered “good” in most data centers! • Uneven application fit • Each server has CPU, memory, disk: • most applications exhaust one resource, stranding the others • Uncertainty in demand • Demand for a new service can spike quickly • Risk management • Not having spare servers to meet demand brings failure just when success is at hand

Goal: Agility – Any service, Any Server • Turn the servers into a single large fungible pool • Dynamically expand and contract service footprint as needed • Benefits • Lower cost (higher utilization) • Increase developer productivity • Achieve high performance and reliability

Datacenter Networks Provide the illusion of “One Big Switch” Storage (Disk, Flash, …) Compute 10,000sof ports

Datacenter Traffic Growth • Today: Petabits/s in one DC • More than core of the Internet! • Source: “Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network”, SIGCOMM 2015.

Latency is King Who does she know? Large-scale Web Application Traditional Application What has she done? • 1 user request  1000s of messages over DC network • Microseconds of latency matter • Even at the tail (e.g., 99.9th percentile) App. Logic App Logic Alice << 1µs latency App Tier DataStructures 10μs-1ms latency App Logic App Logic App Logic App Logic App Logic App Logic App Logic App Logic App Logic App Logic Fabric Single machine Data Tier Eric Minnie Pics Videos Apps Data Center • Based on slide by John Ousterhout (Stanford)

Datacenter Arms Race • Amazon, Google, Microsoft, Yahoo!, … race to build next-gen mega-datacenters • Industrial-scale Information Technology • 100,000+ servers • Located where land, water, fiber-optic connectivity, and cheap power are available

Computers + Net + Storage + Power + Cooling

DC Networks — L2 pros, cons? — L3 pros, cons? Internet CR CR DC-Layer 3 . . . AR AR AR AR DC-Layer 2 S S • Key • CR = Core Router (L3) • AR = Access Router (L3) • S = Ethernet Switch (L2) • A = Rack of app. servers . . . S S S S … … A A A A A A ~ 1,000 servers/pod == IP subnet Reference – “Data Center: Load balancing Data Center Services”, Cisco 2004

Reminder: Layer 2 vs. Layer 3 • Ethernet switching (layer 2) • Fixed IP addresses and auto-configuration (plug & play) • Seamless mobility, migration, and failover • Broadcast limits scale (ARP) • No multipath (Spanning Tree Protocol) • IP routing (layer 3) • Scalability through hierarchical addressing • Multipath routing through equal-cost multipath • Can’t migrate w/o changing IP address • Complex configuration

Layer 2 vs. Layer 3 for Data Centers

Data center networks • load balancer: application-layer routing • receives external client requests • directs workload within data center • returns results to external client (hiding data center internals from client) Internet Border router Load balancer Load balancer Access router Tier-1 switches B A C Tier-2 switches TOR switches Server racks 7 6 5 4 8 3 2 1

Scaling a LAN network • Self-learning Ethernet switches work great at small scales, but buckle at larger scales • Broadcast overhead of self-learning linear in the total number of interfaces • Broadcast storms possible in non-tree topologies • Goals • Scalability to a very large number of machines • Isolation of unwanted traffic from unrelated subnets • Ability to accommodate general types of workloads (Web, database, MapReduce, scientific computing, etc.)

Data center networks • rich interconnection among switches, racks: • increased throughput between racks (multiple routing paths possible) • increased reliability via redundancy Tier-1 switches Tier-2 switches TOR switches Server racks 7 6 5 4 8 3 2 1

Broad questions • How are massive numbers of commodity machines networked inside a data center? • Virtualization: How to effectively manage physical machine resources across client virtual machines? • Operational costs: • Server equipment • Power and cooling

Data Center Network

Hierarchical Addresses

PortLand: Location Discovery Protocol • Location Discovery Messages (LDMs) exchanged between neighboring switches • Switches self-discover location on boot up

Data Center Packet Transport • Large purpose-built DCs • Huge investment: • R&D • business • Transport inside the DC • TCP rules • 99.9% of traffic

TCP in the Data Center • TCP does not meet demands of apps. • Suffers from bursty packet drops, Incast, ... • Builds up large queues: • Adds significant latency. • Wastes precious buffers, esp. bad with shallow-buffered switches. • Operators work around TCP problems • Ad-hoc, inefficient, often expensive solutions • No solid understanding of consequences, tradeoffs

Partition/Aggregate Application Structure Deadline = 250ms MLA MLA TLA Picasso • Time is money • Strict deadlines (SLAs) • Missed deadline • Lower quality result ……… 1. Art is a lie… 1. 1. Deadline = 50ms 2. The chief… 2. Art is a lie… 2. Art is… ….. 3. ….. ….. 3. 3. Picasso “I'd like to live as a poor man with lots of money.“ “The chief enemy of creativity is good sense.“ “Computers are useless. They can only give you answers.” “Bad artists copy. Good artists steal.” “Art is a lie that makes us realize the truth. “It is your work in life that is the ultimate seduction.“ “Inspiration does exist, but it must find you working.” “Everything you can imagine is real.” Deadline = 10ms Worker Nodes

Generality of Partition/Aggregate • The foundation for many large-scale web applications. • Web search, Social network composition, Ad selection, etc. • Example: Facebook • Partition/Aggregate ~ Multiget • Aggregators: Web Servers • Workers: Memcached Servers Internet Web Servers Memcached Protocol Memcached Servers

Workloads • Partition/Aggregate (Query) • Short messages [50KB-1MB] (Coordination, Control state) • Large flows [1MB-50MB] (Data update) Delay-sensitive Delay-sensitive Throughput-sensitive

Tension Between Requirements High Throughput Low Latency High Burst Tolerance • Deep Buffers: • Queuing Delays • Increase Latency • Shallow Buffers: • Bad for Bursts & • Throughput Objective: Low Queue Occupancy & High Throughput • AQM – RED: • Avg Queue Not Fast • Enough for Incast • Reduced RTOmin • Doesn’t Help Latency

Review: The TCP/ECN Control Loop Sender 1 ECN = Explicit Congestion Notification ECN Mark (1 bit) Receiver Sender 2

Two Key Ideas • React in proportion to the extent of congestion, not its presence • Reduces variance in sending rates, lowering queuing requirements • Mark based on instantaneous queue length. • Fast feedback to better deal with bursts.

DCTCP in Action (Kbytes) Setup: Win 7, Broadcom 1Gbps Switch Scenario: 2 long-lived flows, K = 30KB

Why it Works • High Burst Tolerance • Large buffer headroom → bursts fit • Aggressive marking → sources react before packets are dropped • Low Latency • Small buffer occupancies → low queuing delay • 3. High Throughput • ECN averaging → smooth rate adjustments, low variance

Current solutions for increasing data center network bandwidth FatTree BCube 1. Hard to construct 2. Hard to expand

Fat-Tree • Inter-connect racks (of servers) using a fat-tree topology • Fat-Tree: a special type of Clos Networks (after C. Clos) K-ary fat tree: three-layer topology (edge, aggregation and core) • each pod consists of (k/2)2 servers & 2 layers of k/2 k-port switches • each edge switch connects to k/2 servers & k/2 aggr. switches • each aggr. switch connects to k/2 edge & k/2 core switches • (k/2)2 core switches: each connects to k pods

Fat-Tree Fat-tree with K=4

Why Fat-Tree? • Fat tree has identical bandwidth at any bisections • Each layer has the same aggregated bandwidthCan be built using cheap devices with uniform capacity • Each port supports same speed as end host • All devices can transmit at line speed if packets are distributed uniform along available paths • Great scalability: k-port switch supports k3/4 servers Fat tree network with K = 3 supporting 54 hosts

Data Center Networks

Data Center Networks

Presentation Transcript

Overview: Internet vs Data Center Networks

Data Center Networks for the Application

Scalable Label Assignment in Data Center Networks

Data Networks

Hedera : Dynamic Flow Scheduling for Data Center Networks

Hedera : Dynamic Flow Scheduling for Data Center Networks

Lecture 18: Congestion Control in Data Center Networks

TCP Incast in Data Center Networks

Performance Diagnosis and Improvement in Data Center Networks

ElasticTree : Saving Energy in Data Center Networks

zUpdate : Updating Data Center Networks with Zero Loss

Data Center Interconnect Solution for EVPN Overlay networks

Data Networks

ElasticTree : Saving Energy in Data Center Networks

Packet Transport Mechanisms for Data Center Networks

Hedera : Dynamic Flow Scheduling for Data Center Networks

ElasticTree : Saving Energy in Data Center Networks

Data Networks

Chartis Path Inference in Data Center Networks

Performance Diagnosis and Improvement in Data Center Networks

Data Center Networks and Basic Switching Technologies

Data Center, Data Center Comparison