1 / 20

Data Management in the Cloud

Data Management in the Cloud. Paul Szerlip. The rise of data. Think about this For the past two decades, the largest generator of data was humans -- now it's our devices Cheap sensors Cellphones are packed with sensory information Images, video, audio, etc Expensive sensors

amber
Télécharger la présentation

Data Management in the Cloud

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Management in the Cloud Paul Szerlip

  2. The rise of data • Think about this • For the past two decades, the largest generator of data was humans -- now it's our devices • Cheap sensors • Cellphones are packed with sensory information • Images, video, audio, etc • Expensive sensors • DZero, high energy physics, generates 1 TB a day • How do you deal with that much data? [1,2]

  3. Data in the cloud • Storing the data • Bigtable, S3, NoSQL, etc • Processing the data • MapReduce, Hadoop, etc

  4. Good data management in the cloud • Availability • Accessible in cases of partial network failure or datacenter failure • Scalability • Support for massive database sizes - spread across many servers • Elasticity • Scaling up and scaling down • Performance • Efficient system storage utilization • Multitenancy • Many applications on the same hardware

  5. Good data management (continued) • Load and Tenant Balancing • Moving load between servers • Fault Tolerance • Tolerating network or hardware failures • Running in heterogeneous environment • Dealing with hardware degredation • Flexible query interface • Providing ways to access both SQL and non-SQL languages

  6. Overarching Themes • Frustration with ACID on the cloud • (Atomicity, consistency, isolation, durability) • Hard to maintain ACID guarantees with data replication over large geographic distances [1] • Consistency, Availability, Tolerance to partitions, choose 2 • Rise of NoSQL (a misnomer) [2] • Eventually consistent can be okay, some ACID properties are relaxed or left to application developers

  7. Investigating 3 Systems • Bigtable (Google) • And quick look at MapReduce • Amazon:S3/SimpleDB • Open source NoSQL alternatives: • Cassandra (key-value) • MongoDB (document)

  8. Bigtable • Distributed storage designed to scale to petabyte size databases spread across thousands of servers [1] • Used extensively by Google • Not fully relational • "Sparse, distributed, persistent multidimentional sorted map" [1] • Uses Google File System (GFS) under the hood • Index using row keys • Tablet = range of row keys, used for load balancing

  9. Bigtable Diagram [2]

  10. Bigtable • GFS • SSTable • Provides a persistent immutable ordered map • Chubby provides locking mechanism • Ensures single master • Location of bigtable data • Storing schema information and access control lists • Each Bigtable is allocated to one master, and many multiple tablet servers • Master assigns tablets to different tablet servers, dynamically based on server load • Tablets handle read-write

  11. MapReduce • Introduced by Google in 2004 [1] • Often used to operate on Bigtable data [1] • A means to process large amounts of data in a distributed environment in a highly parallelized manner

  12. MapReduce Steps • Input files split into M pieces, multiple copies of program started on cluster • One copy is master, M map tasks, R reduce tasks assigned to idle workers • Worker reads file split contents, passes to map function - results buffered in memory • Buffered results written to local disk periodically, partitioned into R regions by partitioning function, locations passed to master

  13. MapReduce (continued) • Reduce worker notified about location, reads buffered data from map workers, sorts so that same keys are grouped together • Reduce worker passes key and intermediate values to Reduce function, output is appended to final output file • After all map and reduce tasks completed, master wakes up user program

  14. S3 - Simple Storage Service • "Infinite" store for objects of variable size [1] • Organized in 2 levels • Buckets • Like folders, you can save any number of objects in them • Objects • Byte container (up to 5 GB) and metadata (up to 2KB) • Limited search • Single bucket, name only

  15. SimpleDB • Organized into domains (tables) where you can insert data, get data, or run queries [1] • Each domain has items which are descibed by attribute name/value pairs • No schema • API Access- • CreateDomain, DeleteDomain, PutAttributes, DeleteAttributes, GetAttributes, and Select • Meant for fast reads • Keeps multiple copies of the domains

  16. NoSQL • What does this mean? • More about relaxing ACID than being "No" SQL [2] • Lots of open source NoSQL systems • Zynga was big on NoSQL • Why to use them? • Excellent elasticity • Flexible data models - often schema-less • CHEAP (relative to RDBMS) • (if you have lots of frequent and small writes)

  17. Types of NoSQL • Key-value • Redis, Cassandra, etc. • Document store • CouchDB, mongoDB, etc • Graph dbs, object stores • Won't go into these much

  18. Cassandra • Highly scalable, eventually consistent, distributed, structured, key-value store [1] • Open sourced by Facebook (2008) [1] • ColumnFamily based • Column is a tuple of {key, value, timestamp} • ColumnFamilies contain many columns, all referenced by row-key • Kind of like a hybrid of Dynamo and Bigtable [1]

  19. MongoDB • Document-oriented • High input read/write • High availability • Scalability • Flexible query language

  20. References • [1] Sakr, S., Liu, A., Batista, D.M., Alomari, M., A Survey of Large Scale Data Management Approaches in Cloud Environments, IEEE Communications, 2011. • [2] Cloud Computing: Theory and Practice (our lecture notes) • [3] http://www.mongodb.org/display/DOCS/Introduction

More Related