1 / 19

Teaching HDFS/MapReduce Systems Concepts to Undergraduates

Teaching HDFS/MapReduce Systems Concepts to Undergraduates Linh B. Ngo*, Edward B. Duffy**, Amy W. Apon* * School of Computing, Clemson University ** Clemson Computing and Information Technology, Clemson University. Contents. Introduction and Learning Objectives Challenges

feleti
Télécharger la présentation

Teaching HDFS/MapReduce Systems Concepts to Undergraduates

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Teaching HDFS/MapReduce Systems Concepts to Undergraduates Linh B. Ngo*, Edward B. Duffy**, Amy W. Apon* * School of Computing, Clemson University ** Clemson Computing and Information Technology, Clemson University

  2. Contents • Introduction and Learning Objectives • Challenges • Hadoop computing platform – Options and Solution • Module Content – Lectures, Assignments, Data • Student Feedback • Module Content – Project • Ongoing and Future Work

  3. Introduction andLearning Objectives • Hadoop/MapReduce is an important current technology in the area of data-intensive computing • Learning objectives: • Understand the challenges of data-intensive computing • Become familiar with the Hadoop Distributed File System (HDFS), the underlying driver of MapReduce • Understand the MapReduce (MR) programming model • Understand the scalability and performance of MR programs on HDFS

  4. Challenges • Provide students with a high performance, stable, and robust Hadoop computing platform • Balance lecture and hands-on lab hours • Demonstrate the technical relationship between MapReduce and HDFS

  5. Computing Platform Options • MapReduce parallel programming interface • WebMapReduce is an example • Enables study of MR programming model at beginning level • Does not enable the study of HDFS for advanced students • Dedicated shared Hadoop cluster with individual accounts • Multiple student programs compete for resources • Individual errors affect other students • Dedicated cluster that supports multiple virtual Hadoop clusters • Not supported by Clemson’s supercomputer configuration

  6. Computing Platform Solution • Modification of SDSC’s myHadoop • Individual Hadoop platform deployment for each student in the class • First setup: • Medium amount of editing needed to set up • Numerous errors due to typos/unable to configure • Second setup: • Minimal amount of editing needed (one line) • Only a few students encountered errors due to typos

  7. Lecture and Hands-on Labs • Fall 2012: 5 class hours • 1 MR lecture, 1 lab, 1HDFS lecture, 1 lab, 1 advanced MR optimization • Lab time not sufficient due to problems with Hadoop computing platforms • Spring 2013: 5 class hours • Lab time still not sufficient, due to errors in modifying myHadoop scripts • Fall 2013: 7 class hours • 1 MR lecture, 2 labs, 1 HDFS lecture, 2 labs, 1 HBase/Hive lecture

  8. Module Content: Lectures • Reused available online material with addition clarification • Slides from UMD, Jimmy Lin • Strong emphasis on the following points: • The MR programming paradigm is a programming model that handles data parallelization • The HDFS infrastructure provides a robust and resilient way to distribute big data evenly across a cluster • The MR library takes advantages of HDFS and the MR programming paradigm to enable programmers to write applications to conveniently and transparently handle big data • Data locality is the big theme in working with big data

  9. Module Content: Lectures HDFS Abstractions: Directories/Files Block metadata lives in memory File 01 File 02 File 03 HDFS Blocks DataNodes report block information to NameNode Physical View at Linux FS: blk_xxx blk_xxx blk_xxx NameNode JobTracker provides NameNode with file/directory paths and receives block-level information. RAM RAM RAM RAM RAM RAM Could be the same machine HDFS DataNode daemons controlling block location HDD HDD HDD HDD HDD HDD CPU CPU CPU CPU CPU CPU JobTracker MapReduce TaskTracker daemons executing tasks on blocks Detailed job progress lives in memory • TaskTrackers report progress to JobTracker • JobTracker assigns work and faciltate map/reduce on TaskTrackers based on block location information from NameNode DataNode DataNode DataNode DataNode TaskTracker TaskTracker TaskTracker TaskTracker

  10. Module Content: Assignments and Data • Assignments • One MR programming assignment basing on existing codes that familiarizes students with the MR API and programming flows • One MR/HDFS programming assignment that requires students to write a MR program and deploy it to run on a Hadoop computing platform • Data • Strive to be realistic • Big enough, but not too big • Airline Traffic Data (12Gb), Google Trace (200Gb), Yahoo Music Rating (10Gb), Movie Rating (250Mb)

  11. Student Feedback • In-class voluntary surveys help to encourage all students to participate (as compared to out-of-class online survey) • IRB approval for survey • Questions addressing: • Improvements in technical skills • Improvements in understanding about Hadoop/MR • Time taken to complete Hadoop/MR assignments • Time taken to set up Hadoop on Palmetto • Usefulness of guides/lectures/labs • Relevancy of Hadoop/MR topics • Appropriate level to begin teaching Hadoop/MR

  12. Student Feedback

  13. Student Feedback Primary student requests: • Fall 2013 • More labs • More details in HDFS guide • Spring 2014 • FAQ to address common configuration errors/interpretation of MR compilation errors • More time for projects • Reduced dependency between two Hadoop/MR assignments

  14. Module Content: Project • Was added to the course in Spring 2014 • Project in place of assignments • Three categories: • Data Analytics • Big data set • Interesting analytic problem relating to data • Performance Comparison • Big data set • Comparison between Hadoop MapReduce and MPI • System implementation • Augmenting myHadoop with additional software modules: Spark, HBase, or Hadoop 2.0 • Required IEEE two-column conference format for reports

  15. Module Content: Project • Data Sets: • Airline Traffic Data (12Gb) • NOAA Global Daily Weather Data (15-20Gb) • Amazon Food Reviews (354Mb – hundreds of thousands of entries) • Amazon Movie Reviews (8.7Gb – millions of entries) • Meme Trackers (53Gb - texts) • Million Song Dataset (196Gb HD5 compressed) • Google Trace Data (~171Gb)

  16. Module Content: Project • Comparing performance between Hadoop and MPI-MR (Sandia) using Amazon Movie Reviews • Configuration and installation of Hadoop 2.0 on myHadoop • Amazon Crawler using iterative implementation of Hadoop MR • Performance comparison between Hadoop/MPI/MPI-IO on NOAA data • Performance comparison between Hadoop/MPI/MPI-IO on Google Trace data

  17. Module Content: Project • Positive Evaluation • Appropriateness of scope: 8.17/10 • Appropriateness of difficulty: 7.74/10 • Applicability of Hadoop/MR: 8.94/10 • Student Feedback • An integral element of the module/course • More time is needed • Start the project earlier in the semester • Less assignment, more project

  18. Ongoing Work • Transition to Hadoop 2.0 • Inclusion of other current distributed and data-intensive technologies: • Spark/Shark for in-memory computing • Cascade/Tez for workflow computing • Swift? • Inclusion of additional real world data and problems in student projects

  19. Questions? Fall 2012: https://sites.google.com/a/g.clemson.edu/cp-cs-362/ Spring 2013: https://sites.google.com/a/g.clemson.edu/cpsc362-sp2013/ Fall 2013: https://sites.google.com/a/g.clemson.edu/cpsc362-fa2013/ Spring 2014: https://sites.google.com/a/g.clemson.edu/cpsc3620-sp2014/

More Related