1 / 39

Big Data & Hadoop

Hannah Jones presents. Big Data & Hadoop. Agenda. What is BigData ? What is Hadoop? Hadoop vs. Traditional Database Who uses Hadoop? Brief History How does Hadoop work ? Storage Processing Other Projects Presentation Layer. What is Big Data?.

ogden
Télécharger la présentation

Big Data & Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hannah Jones presents Big Data & Hadoop

  2. Agenda • What is BigData? • What is Hadoop? • Hadoop vs. Traditional Database • Who uses Hadoop? • Brief History • How does Hadoop work? • Storage • Processing • Other Projects • Presentation Layer

  3. What is Big Data? • Collections of large and complex data sets • Difficult to process • Challenges to capture, store, search, share, transfer, analyze and visualization • Volume, Velocity, Variety

  4. New types of Data • Structured • Unstructured • Tweets • Images • Audio • Text Messages • Click Trails

  5. Big Data = Big ROI • Healthcare: 20% decrease in patient mortality by analyzing streaming patient data • Telco: 92% decrease in processing time by analyzing networking and call data • Utilities: 99% improved accuracy in placing power generation resources by analyzing 2.8 petabytes of untapped data

  6. Software Available • Amazon DynamoDB • Couchbase 2.0 • Apache Cassandra • Apache Hadoop

  7. Hadoop

  8. HADOOP! The… Database…? • Massively scalable storage and batch data processing system • Replace some functions of DBs • Simultaneously ingesting, processing and delivering/exporting large volumes of data • Absorb any type of data from any source

  9. Traditional DB • Databases (MySQL, SQL Server, Oracle, etc..) • Transactional systems, reporting, and archiving • Reads & Writes for “reasonable” data sets (< 1B rows) • Real time or batch processing

  10. Hadoop • Overcomes the traditional limitations of storage and computing • Provides linear scalability from 1 to 4000 servers • Low cost, open source software • Leverage inexpensive, commodity hardware as platform

  11. Hadoop • Analysis of highly granular data • Structured and Unstructured data • Fast integration of multiple data sources • Responds well to rapidly changing business requirements

  12. Hadoop vs. Databases • Scale-Out instead of Scale-Up • Key/value pairs instead of relational tables • Functional programming (MapReduce) instead of declarative queries (SQL)  • Offline batch processing instead of online transactions • More complex security

  13. Who uses Hadoop?

  14. History

  15. Hadoop in Detail • 2 Main pieces • HDFS • Storage • Files and directories • Provides high bandwidth access to the data • MapReduce • Processing – Manages Jobs

  16. Hadoop in Detail • 2 Main pieces • HDFS • Storage • Files and directories • Provides high bandwidth access to the data • MapReduce • Processing – Manages Jobs

  17. HDFS – Gettysburg Address Example

  18. HDFS – Gettysburg Address Example

  19. HDFS – Gettysburg Address Example

  20. HDFS – Gettysburg Address Example

  21. HDFS – Gettysburg Address Example

  22. HDFS – Gettysburg Address Example

  23. NameNode Disabilities • Loose all 3 machines => Lose all of the data • Very Rare • Single Point of Failure • Doesn’t fail very often • Secondary NameNode • A backup • Not automatic

  24. Hadoop in Detail • 2 Main pieces • HDFS • Storage • Files and directories • Provides high bandwidth access to the data • MapReduce • Processing – Manages Jobs

  25. Map Reduce • Break a massive task into smaller chunks • Process in parallel • Key/Value Pairs vs. Columns/Rows -> map() function

  26. Steps to understanding MapReduce • Think in terms of keys and values • Write a Mapper • Write a Reducer

  27. Job and Task Trackers • Job Tracker • MapReduce Coordinator • 1 JobTracker for an entire cluster • Accepts user’s jobs • Divides it into tasks • Assign to individual TaskTracker • TaskTracker reports statuses as it runs • Notice if a TaskTrackerdisappears if failure • Assign the things running on that Tasktrackerto another task tracker

  28. MapReduce– Gettysburg Address Example • Hello World :: Regular Programming • Word Count :: Map Reduce Programming

  29. Word Count Example Map

  30. Word Count Example

  31. Word Count Example Shuffle <Four, [1, 1]> <Score, [1, 1, 1]> <and, [1, 1, 1, 1, 1, 1, 1]> <seven, [1, 1, 1, 1]> <years, [1]> <ago, [1, 1, 1]> <our, [1, 1, 1, 1, 1]> <fathers, [1]> <brought, [1]>

  32. Word Count Example Reduce <Four, 2> <Score, 3> <and, 7> <seven, 4> <years, 1> <ago, 3> <our, 5> <fathers, 1> <brought, 1>

  33. MapReduce jobs can be tedious to write…

  34. MapReduce jobs can be tedious to write…

  35. But wait… there’s more

  36. Zoo Keeper

  37. New Projects • Hcatalog • Store meta data • Oozie • Scheduling System • Sqoop • Transfer data • Mahout • Machine learning w/ MapReduce • BigTop • Integrates all of these projects

  38. Getting Data Out • Databases • Datameer • Tableau

  39. Summary • BigData is the new trend in businesses • Hadoop is a way to store and process the data • Made up of many projects • Presentation tools are a great way to visualize the data

More Related