Introduction

Introduction

Readings • The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, L. Barroso and U. Holze

Introduction • Increasingly we are seeing more of our applications moving from the PC to the Internet e.g., • Email – gmail, yahoo • Photo management – Picasso, Kodak, Sutterbug • Word processing – Google apps • Why? • Less work on the user’s behalf • Maybe the potential for less cost for the user

Introduction • To support this move from the PC to the “Internet” requires a large number of servers, storage, network support etc; • Companies like Amazon, Google, eBay are running data centers with tens of thousands of machines • To make users trust these systems requires that a number of issues be addressed e.g., failure handling

Architecture

Architecture • Common elements include • Low end servers typically in a blade enclosure within a rack • The interconnection of servers within a rack is supported with a local Ethernet switch (rack switch) • The local Ethernet switch has a number of uplink connections to one or more cluster-level (data center level) Ethernet switch

Storage • Disks can be connected directly to each server and managed by a global distributed file system (e.g., Google’s GFS); or • Disks can be part of Network Attached Storage (NAS) devices that are directly connected to the cluster level switch

Storage • NAS • Reliability is provided by the device through replication and error codes • Server node • Need a fault-tolerant file system at the cluster level which is not trivial to implement • Writes are slower • Potentially is lower cost then using NAS • Disks can be the same as what is on your PC

Storage Hierarchy

Networking Fabric • Tradeoffs between speed, scale and cost • Intra rack connectivity is relatively inexpensive to achieve • Network switches with high port counts have a different price structure then switches used for rack connectivity • Much more expensive • Network switches with few ports require programmers to be aware of the scarce bandwidth

Latency, Bandwidth, Capacity • Much faster for an application to retrieve data from local disks then from off rack disks but • Applications often need more storage then found on a local disk (e.g., Google search) • How is this dealt with efficiently?

Power Usage • Peak power usage measured at one of Google’s data centers: • Networking 5% • CPUs 23% • Disks 10% • DRAMS 30% • Other 22%

Handling Failures • The high number of components almost guarantee failures • Disk drives can exhibit annualized failure rates higher than 4% • Lots of restarts needed • This issue has received a good deal of attention

Request Handling • Lots of disks so how is data placed so that it can be found • Let’s look at Amazon • Partition the data so that groups of servers handle just a part of the inventory (or any other data) • Router needs to be able to extract keys from request • Hashing is one strategy for doing this • Based on the key you then determine the server to handle the request

Internet-time implies constant change Need acceptable quality Three approaches to managing upgrades Fast reboot: Cluster at a time Minimize yield impact Rolling upgrade: Node at a time Versions must be compatible Big flip: Half the cluster at a time Reserved for complex changes Either way: use staging area, be prepared to revert Online Evolution

Summary • We have briefly discussed a high-level view of data centers • In this course we will discuss how Google, Amazon, etc deal with some of the implications of these architectures

Introduction

Introduction

Presentation Transcript

Introduction to introduction to introduction to … Optimization

INTRODUCTION/ INTRODUCTION

Introduction

INTRODUCTION

Introduction

Introduction