1 / 15

D ata I ntensive C loud C omputing

D ata I ntensive C loud C omputing. Randal E. Bryant Carnegie Mellon University. http://www.cs.cmu.edu/~bryant. “I’ve got terabytes of data. Tell me what they mean.” Very large, shared data repository Complex analysis Data-intensive scalable computing (DISC).

nascha
Télécharger la présentation

D ata I ntensive C loud C omputing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DataIntensiveCloud Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant

  2. “I’ve got terabytes of data. Tell me what they mean.” Very large, shared data repository Complex analysis Data-intensive scalable computing (DISC) “I don’t want to be a system administrator. You handle my data & applications.” Hosted services Documents, web-based email, etc. Can access from anywhere Easy sharing and collaboration Varieties of Clouds

  3. Examples of Big Data Sources • Wal-Mart • 267 million items/day, sold at 6,000 stores • HP building them 4PB data warehouse • Mine data to manage supply chain, understand market trends, formulate pricing strategies • Sloan Digital Sky Survey • New Mexico telescope captures 200 GB image data / day • Latest dataset release: 10 TB, 287 million celestial objects • SkyServer provides SQL access • Next generation LSST even bigger

  4. Role of Computers in Scientific Research • Simulation • Given: Hypothetical model of system • Determine: What behaviors does it predict • Requires: Lots of computer cycles • e.g., supercomputers (and perhaps grids) • Analysis • Given: Large amounts of data • Measured, or generated by simulations • Determine: What phenomena are present • Requires: Lots of colocated processing & storage • e.g., DISC (and perhaps clouds)

  5. Our Data-Driven World • Science • Data bases from astronomy, genomics, natural languages, seismic modeling, … • Humanities • Scanned books, historic documents, … • Commerce • Corporate sales, stock market transactions, census, airline traffic, … • Entertainment • Internet images, Hollywood movies, MP3 files, … • Medicine • MRI & CT scans, patient records, …

  6. Why So Much Data? • We Can Get It • Automation + Internet • We Can Keep It • Seagate 1 TB Barracuda @ $199 (20¢ / GB) • We Can Use It • Scientific breakthroughs • Business process efficiencies • Realistic special effects • Better health care • Could We Do More? • Apply more computing power to this data

  7. Oceans of Data, Skinny Pipes • 1 Terabyte • Easy to store • Hard to move

  8. Data-Intensive System Challenge • For Computation That Accesses 1 TB in 5 minutes • Data distributed over 100+ disks • Assuming uniform data partitioning • Compute using 100+ processors • Connected by gigabit Ethernet (or equivalent) • System Requirements • Lots of disks • Lots of processors • Located in close proximity • Within reach of fast, local-area network

  9. DISC Distinguishing Features • Active Data Management • System maintains up-to-date copy of data • Colocate processing and storage • High-Level Programming Model • Express parallel computation without details of hardware • Interactive Access • Support simple queries up to demanding computations • Flexible Error Detection & Recovery • Consider hardware and software failures as part of normal operation • Runtime system hides effects from user

  10. Using Clouds for Data-Intensive Computing • Goal • Get researchers & students active in DISC • Without investing in lots of hardware • Hardware: Rent from Amazon • Elastic Compute Cloud (EC2) • Generic Linux cycles for $0.10 / hour ($877 / yr) • Simple Storage Service (S3) • Network-accessible storage for $0.15 / GB / month ($1800/TB/yr) • Software • Hadoop Project • Open source project providing file system and MapReduce programming model • Supported and used by Yahoo • Prototype on single machine, map onto cluster

  11. Access to Cloud Resources • Google setting up dedicated cluster for university use • Loaded with open-source software • Including Hadoop • IBM providing additional software support • NSF will determine how facility should be used.

  12. More Cloud Resources • Yahoo: Major supporter of Hadoop • Yahoo plans to work with other universities

  13. Using Clouds for Data-Intensive Computing • Build Clusters for Important Data Sets • E.g., sky survey, protein database • System collects and manages data • Remote users execute programs on cluster • Motivation • Avoid duplicate efforts at data collection & stewardship • Easier to move programs to data than data to programs • Issues • Who has what kinds of access? • Not just read/write file permissions • What metadata to create and maintain? • How to allocate computation & storage

  14. Data-Intensive Cloud Hurdles • Colocating Data & Computation • Commodity machines  limited bandwidth • Across & within data centers • Critical Mass of Users • Must have desire to share single large data set • Some examples • Shared scientific data • Crawled web pages • Within corporation

  15. More Information • “Data-Intensive Supercomputing: The case for DISC” • Tech Report: CMU-CS-07-128 • Available from http://www.cs.cmu.edu/~bryant

More Related