Unleashing Big Data Potential: Exploring Hadoop

1 BIG DATA & HADOOP INSIGHTS Abstract The Apache Hadoop project develops an open source software for reliable, scalable and distributed computing. Apache Hadoop is a core component of a modern data infrastructure. Supporting data pools, allowing organizations to collect, store, analyze and manipulate massive quantities of data on their own terms. StoneFly is a leading innovator in the industry, creating, distributing and supporting enterprise ready opendata platforms and modern data applications such as the apache Hadoop. Regardless of the data type, source, where stored or under what format, the stoneFly Scale-Out NAS delivers a Scale-Out platform to create a highly efcient and highly scalable way to manage all your data. In this white paper on Big Data and Hadoop you will learn about: The Impact of Big data on businesses and governments What is Apache Hadoop and how it works The Core Technologies of Hadoop Differences between Hadoop data and conventional databases The Apache Hadoop ecosystem How to manage the Hadoop big data with StoneFly's Scale-Out NAS storage plug-in

2 BIG DATA & HADOOP INSIGHTS Why Big Data? Governments and businesses are all gathering lots of data these days. Videos, images, transactions, etc. But why? The answer is that data is incredibly valuable. Analyzing all data lets us do things like detect fraud going years back. In the early days of the Internet age, the GFS or Google File System was created to take on a formidable task – indexing the web. The idea with GFS is that instead of having a giant file storage appliance sitting in the back end, you can instead use industry-standard hardware on a large scale and drive high performance to the sheer number of components. Given that standard hardware is used, you expect failure and you make it reliable through redundancy and replication. Now all hardware fails at some point, and if you have thousands of nodes it’s likely that you will have node failure frequently. But in this model it doesn’t matter because your data doesn’t live in any one place. It is replicated three times around the network to ensure availability. Even better, this replication means the data is broken into chunks and spread throughout many nodes. So now there is no central storage to overwhelm. This also provides scale. As you add more nodes you add more computing power and you’re adding more storage capacity. That’s the idea behind Hadoop. How Hadoop Works Suppose what we wanted to look for an image spread across many hundreds of files. First, Hadoop has to know where that data is. It goes and queries something called the ”Name Node” to find out all the places where the data file is located. Once it has figured that out, it sends your job out to each one of those nodes. Each one of those processors independently reads its input file. Each one of them looks for the image and writes the result out to a local output file. That’s all done in parallel. When they all report “finished,” you’re done. There’s a lot more to Hadoop than just doing image recognition. For instance, you can do statistical data analysis. You might want to calculate means, averages, correla- tions, all sorts of other data. You might want to look at unemployment versus popula- tion versus income versus States. If you have all the data in Hadoop, you can do that.

3 BIG DATA & HADOOP INSIGHTS You can also do machine learning and all sorts of other analysis. Once you have got the data in Hadoop, there is almost no limit to what you can do. In Hadoop, data is always distributed, both the input and the output. The data is also replicated. Copies are kept of all the data blocks, so if one node falls over, it doesn’t afect the result. That is how we get reliability. But sometimes we need to communicate between nodes. It is not enough that everybody processes their local data alone. An example of that is counting or sorting. For that purpose, the Hadoop storage system is paired with a computing model called MapReduce. The idea is that you take your task which is data oriented and you trunk it up and distribute it on the network, such that every piece of work is done within the network by the machine that has the piece of data that needs to be worked on. So now not only does your storage scale to compute implicitly, but you need a lot less network bandwidth because you are not transferring massive amounts of data around.

4 BIG DATA & HADOOP INSIGHTS The Core Technologies of Hadoop At its core, Hadoop is a distributed file system and a processing paradigm. That file system is called HDFS (Hadoop Distributed File System), and the processing paradigm is called MapReduce. Hadoop Distributed File System (HDFS) HDFS is a file system written in Java which sits on top of the native file system for whatever OS you operate. It’s built on top of x86 standards which are very cost efective when it comes to processing, particularly when compared to high-performance computing or HPC. HPC certainly has its place and it’s a great technology, but it’s not a prerequisite for a Hadoop ecosystem. When it comes to x86, they are reference architectures for whatever brand of server you’d like to use, so there’s a lot of flexibility. As discussed earlier, it was built to expect failures from these servers. The idea is that the data comes in, it lives on these servers and you can push workloads to each of these servers to live locally rather than pulling it into a central location. That’s a huge advantage. MapReduce MapReduce is the processing paradigm that pairs with HDFS. It is a distributed computational algorithm that pushes the compute down to each of the x86 servers. It is a combination of a Map procedure and a Reduce procedure. The Map procedure performs filtering and sorting of the data and the reduce procedure performs summary operations. How MapReduce Works? MapReduce is a way of putting a summary, a clifs notes version on each server of what data that server contains. You can look at it as a table of contents. Each Data

5 BIG DATA & HADOOP INSIGHTS Node creates a table of contents of what information it contains. Those table of contents go into one central server which is essentially the search function or the “Name Node.” The “Name Node” or the search function tells on what particular “Data Node” the file data is kept. That is something that is being done through MapReduce and Hadoop together. These are the technologies that are driving Big Data. Let’s take a little application called Count Dates. This application counts the number of times a date occurred spread across many diferent files(See the above figure). The First phase is called the map phase. Each processor that has an input file, reads the input file in, counts the number of times those dates occurred, and then writes it in as a set of key and value pairs. After that’s done, we have what’s called the shufe

6 BIG DATA & HADOOP INSIGHTS phase. Hadoop automatically sends all of the 2000 data to one processor, all the 2001 data to another processor and the 2002 data to another processor. After that shufe phase is complete we can do what’s called a reduce. In the reduce phase, all of the 2000 data is summed up and written to the output file. When every Data Node is done with its summations, they report done and the job is done. Hadoop vs. Conven�onal Databases We’ve seen a couple of great examples of how Hadoop works. The next question is how does Hadoop compare to conventional relational databases because they’ve dominated the market for years. One big diference is that in Hadoop data is distributed across many nodes and the processing of that data is distributed. By contrast,

7 BIG DATA & HADOOP INSIGHTS in a conventional relational database, conceptually all the data sits on one server and one database. But there are more diferences than that. Relational Databases Hadoop and Allies Optimized for queries that return small Optimized for creating and processing datasets large datasets Multi-write(r) data – transactions and locks Archival data – write-once, ready many Structured data – rigid database rules/sche- Semi-structured data – variable formats of ma Data Conceptually a single computer and Distributed data on “off the shelf” hardware attached storage and clusters Standard SEQUEL programming Lightweight SEQUEL dialects – noSQL Fault-tolerance thru high-performance Fault-tolerance thru replication hardware and RAID The biggest diference is that in Hadoop data is “write once, read many.” In other words, once you’ve written data, you are not allowed to modify it. You can delete it but you cannot modify it. On the other hand, in relational databases, data can be written many times. Like the balance on your account. But in archival data which Hadoop is optimized for, once you’ve written the data you don’t want to modify it. If it’s archival data about telephone calls or transactions, you don’t want to change it once you have written it. There’s another diference too. In relational databases we always use SEQUEL. By contrast, Hadoop doesn’t support SEQUEL at all. It supports lightweight versions of SEQUEL called NoSQL but not conventional SEQUEL. Also, Hadoop is not just a single product or platform. It’s a very rich eco-system of tools and technologies and platforms. Almost all of which are open source and all work together.

8 BIG DATA & HADOOP INSIGHTS The Hadoop Ecosystem At the lowest level, Hadoop just runs on commodity hardware and software. You don’t need to buy any special hardware. It runs on many operating systems. On top of that, is the Hadoop Layer which is MapReduce and the Hadoop Distributed File System (HDFS). On top of that is a set of tools and utilities such as YARN, HIVE, PIG, IMPALA, SOLR and SPARK. And the neat thing about those tools is they support semi-structured or unstructured data. 1. YARN Yarn stands for “Yet another Resource Negotiator.” It is a resource manager for different workloads that you plug-in on top of Hadoop. It manages compute resources and clusters and schedules users’ applications. 2. SQOOP SQOOP has been introduced to help us work with diferent kinds of data and to get the data into the system to start to process it. When it comes to relational

9 BIG DATA & HADOOP INSIGHTS databases, we can pull data from there into HDFS using SQOOP. This tool was arrived at by combining SEQUEL and Hadoop and you can think of it as scooping data out of Relational Databases. In addition to pulling data from databases, SQOOP will also allow you to push it out to them as well. SQOOP can also com- press data to help it make the journey. It is used frequently in the Hadoop distri- butions, but you can also imagine any set of ETL (Extract, transform, load) compa- nies doing the same thing with the graphical interface. 3. FLUME What about other things like web server logs, networking logs, and sensor information? These are less straightforward than a relational database, how do we handle them? That’s where FLUME comes into play. FLUME is a massively distrib- utable framework for event-based data. With FLUME we can bring Streaming Event Data into HDFS. If you think about a physical FLUME for logging, it’s a little river that roots logs and wood to processing factories so you can cut it up and ship wood. The same way they do this for physical logs, we do the same for data logs. Now we’ve got all the data loaded in. 4. HIVE Using HDFS, we can do some work on it. We can write some JAVA, Ruby or Perl codes to do some work. However, why write thousands of lines of code when a few lines of SEQUEL would work. HIVE was created to do just that. HIVE allows you to run SEQUEL that is converted to MapReduce that can run against HDFS. So, now you can get this massively powerful processing in terms of SEQUEL which is great for usability. 5. Apache PIG PIG was created as a higher level scripting language that allows you to create MapReduce programs to run against your data. It is slightly analogous to Oracle’s

10 BIG DATA & HADOOP INSIGHTS PL/SQL. You can think of PIG as having a massive appetite for data to make it easier to remember. If we look at both PIG and HIVE, they help us process massive amounts of data but aren’t always the best for low latency SEQUEL. 6. IMPALA To solve the low latency issue, IMPALA designs a low-latency SEQUEL engine that bypasses MapReduce (see the above figure). This is along the same lines as SEQUEL queries. For instance, for a query to return 10 rows from a 1000 row table, HIVE is going to take 20 - 30 seconds. However, using IMPALA, it takes only milli- seconds. This blazing speed is also how IMPALA came to be the name for this component in the Hadoop ecosystem. 7. SOLR is the search tool in the Hadoop ecosystem. SOLR allows the indexing of all your Hadoop data. This comes in handy if you would like to search your Hadoop data. With SOLR there is the flexibility to route the data being brought in by SQOOP and FLUME directly into SOLR to do indexing on the fly. However, you can also tell solar to index that data in batches. 8. SPARK SPARK has generated a lot of buzz recently with the ecosystem and is delivering on its promise. This is a technology that may be able to ultimately replace Ma- pReduce down the road, in addition to providing real-time streaming capabilities and machine learning. When it comes to MapReduce, it uses a specific technology in a specific approach to breaking up calculations. SPARK, on the other hand, allows you to use a more traditional API for analyzing data that doesn’t make you think in terms of MapReduce. It also leverages in-memory work more than Ma- pReduce can. This means that there are multiple phases or iterations on calculations. SPARK can do this in memory rather than writing to disk between each of

11 BIG DATA & HADOOP INSIGHTS the phases or iterations. This makes for much faster speed. As SPARK becomes more mature, there will eventually be the same ease of use tools for batch processing. The Hadoop Authen�ca�on Tool: SENTRY SENTRY fills the authentication gap in the Hadoop ecosystem. For instance, we may not want Bob to see Mary’s table or vice versa. By enforcing a fine grained role-based authorization to data and metadata stored on the Hadoop cluster, SENTRY creates a role based authentication that is critical for security. High-performance Scale-Out NAS for Hadoop StoneFly’s Scale-Out NAS Storage ofers an enterprise-grade alternative to the under- lying Hadoop Distributed File System (HDFS) that enables you to keep data in a POSIX compatible storage environment while performing big data analytics with a Hadoop MapReduce Framework. To overcome the traditional limitations of hardware-based storage, StoneFly has created an HDFS plug-in that enables MapReduce to run directly on StoneFly’s Scale-Out NAS Storage. This plug-in uses Scale-Out NAS Storage volumes to run Hadoop jobs across multiple name-spaces, allowing you to perform in-place analytics without migrating data in or out of HDFS. Integrating the plugin into the Hadoop ecosystem goes well beyond MapReduce and HDFS. The Hadoop plug-in is compatible with Hadoop-based applications and supports technologies such as Hive, Pig HBase, Tez, Sqoop, Flume and more! In this example we see four Scale-Out NAS Storage servers in a trusted storage pool, split between two zones for high-availability. A separate server runs the “Ambari”

12 BIG DATA & HADOOP INSIGHTS management console, the “Yarn Resource Manager” and the “Job History Server” (See Figure 1). This architecture eliminates the centralized metadata server and supports a fully fault-tolerant system with two or three-way replication across a cluster that can scale anywhere from 2 to 128 nodes. To eliminate complex and time-consuming code re-writes StoneFly’s Scale-Out NAS Storage supports data access to several diferent mechanisms. File access with NFS or SMB, object access with swift and access via the Hadoop file-system API. You can use standard Linux tools and utilities such as Grep, Awk and Python, and take advantage of multi-protocol support including native StoneFly Scale-Out NAS Storage, NFS, SMB, HCFS and swift (See Figure 2).

13 BIG DATA & HADOOP INSIGHTS Figure 1 Figure 2

14 BIG DATA & HADOOP INSIGHTS You also have the ability to add or shrink a cluster on the fly without impacting application availability and perform automatic data rebalancing. Let’s take a closer look at the plugin in action. From the “Ambari” management console you are able to start all the services with a click of a button (See Figure 3). Figure 3 We see there are a number of Hadoop services on the “Ambari” manager node. There are also four nodes in the StoneFly Scale-Out NAS Storage cluster (See Figure 4). In the terminal window (See Figure 5), we see maps and reduces happening in real time on the Scale-Out NAS Storage nodes, and the management console shows us that all the work is complete (See Figure 6). The StoneFly Scale-Out NAS Storage plugin for Apache Hadoop makes it painless and cost efective to run analytics on data in Apache Hadoop, eliminating many of the challenges enterprises face when working with the Hadoop distributed file system.

15 BIG DATA & HADOOP INSIGHTS Figure 4 Figure 5

16 BIG DATA & HADOOP INSIGHTS Figure 6

www.stonefly.com 2865, 2869 and 2879 Grove Way, Castro Valley, CA 94546 USA. +1 (510) 265-1616

Unleashing Big Data Potential: Exploring Hadoop