1 / 15

Hadoop Video Online Training by Expert

Learn about big data and how Hadoop can help process and analyze large amounts of unstructured data. Contact us for expert training.

wwhitney
Télécharger la présentation

Hadoop Video Online Training by Expert

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hadoop Video/Online Training by Expert Contact Us: India: 8121660088 USA : 732-419-2619 Site: http://www.hadooptrainingacademy.com/

  2. Introduction • Big Data: • Big data is a term used to describe the voluminous amount of unstructured and semi-structured data a company creates. • Data that would take too much time and cost too much money to load into a relational database for analysis. •  Big data doesn't refer to any specific quantity, the term is often used when speaking about petabytes and exabytes of data. http://www.hadooptrainingacademy.com

  3. The New York Stock Exchange generates about one terabyte of new trade data per day. • Facebook hosts approximately 10 billion photos, taking up one petabyte of storage.  • Ancestry.com, the genealogy site, stores around 2.5 petabytes of data. • The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month.  • The Large Hadron Collider near Geneva, Switzerland, produces about 15 petabytes of data per year. http://www.hadooptrainingacademy.com

  4. What Caused The Problem? http://www.hadooptrainingacademy.com

  5. So What Is The Problem? http://www.hadooptrainingacademy.com • The transfer speed is around 100 MB/s • A standard disk is 1 Terabyte • Time to read entire disk= 10000 seconds or 3 Hours! • Increase in processing time may not be as helpful because • Network bandwidth is now more of a limiting factor • Physical limits of processor chips have been reached

  6. So What do We Do? • The obvious solution is that we use multiple processors to solve the same problem by fragmenting it into pieces. • Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in under two minutes. http://www.hadooptrainingacademy.com

  7. Distributed Computing Vs Parallelization • Parallelization- Multiple processors or CPU’s in a single machine • Distributed Computing- Multiple computers connected via a network http://www.hadooptrainingacademy.com

  8. Examples Cray-2 was a four-processor ECL vector supercomputer made by Cray Research starting in 1985 http://www.hadooptrainingacademy.com

  9. Distributed Computing The key issues involved in this Solution: • Hardware failure • Combine the data after analysis • Network Associated Problems http://www.hadooptrainingacademy.com

  10. What Can We Do With A Distributed Computer System? • IBM Deep Blue • Multiplying Large Matrices • Simulating several 100’s of characters-LOTRs • Index the Web (Google) • Simulating an internet size network for network experiments http://www.hadooptrainingacademy.com

  11. Problems In Distributed Computing • Hardware Failure: As soon as we start using many pieces of hardware, the chance that one will fail is fairly high. • Combine the data after analysis: Most analysis tasks need to be able to combine the data in some way; data read from one disk may need to be combined with the data from any of the other 99 disks. http://www.hadooptrainingacademy.com

  12. To The Rescue! Apache Hadoopis a framework for running applications on large cluster built of commodity hardware. A common way of avoiding data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. The Hadoop Distributed Filesystem (HDFS), takes care of this problem. The second problem is solved by a simple programming model- Mapreduce. Hadoop is the popular open source implementation of MapReduce, a powerful tool designed for deep analysis and transformation of very large data sets.  http://www.hadooptrainingacademy.com

  13. What Else is Hadoop? http://www.hadooptrainingacademy.com A reliable shared storage and analysis system. There are other subprojects of Hadoop that provide complementary services, or build on the core to add higher-level abstractions The various subprojects of hadoop include: Core Avro Pig HBase Zookeeper Hive Chukwa

  14. Hadoop Approach to Distributed Computing • The theoretical 1000-CPU machine would cost a very large amount of money, far more than 1,000 single-CPU. • Hadoop will tie these smaller and more reasonably priced machines together into a single cost-effective compute cluster. • Hadoop provides a simplified programming model which allows the user to quickly write and test distributed systems, and its’ efficient, automatic distribution of data and work across machines and in turn utilizing the underlying parallelism of the CPU cores. http://www.hadooptrainingacademy.com

  15. Interesting, right? This is just a sneak preview of the full presentation. We hope you like it! To see the rest of it, just click here to view it in full on PowerShow.com. Then, if you’d like, you can also log in to PowerShow.com to download the entire presentation for free.

More Related