Distributed and Parallel Processing Technology Chapter 9. Setting up a Hadoop Cluster

Distributed and Parallel Processing TechnologyChapter 9.Setting up aHadoop Cluster Hwan Hee Kim

Table of Contents • Cluster Specification • Cluster Setup and Installation • SSH Configuration • Hadoop Configuration • Post Install • Benchmarking a Hadoop Cluster

1. Cluster Specification • Hadoop is designed to run on commodity hardware. • “Commodity” does not mean “low-end”. • Low-end machines often have cheap components, which have higher failure rates than more expensive machines. • On the other hand, large database class machines are not recommended either. • Because they could not score high on the cost-performance ratio.

1. Cluster Specification • Why not use RAID? (Redundant Array of Independent Disks) • HDFS clusters do not benefit from using RAID for datanode storage. • The redundancy that RAID provides is not needed, since HDFS handles it by replication between nodes. • RAID striping which is commonly used to increase performance, turns out to be slower than the JBOD configuration used by HDFS. • JBOD performed 10% faster than RAID 0 in one test, and 30% better in another. • If a disk fails in a JBOD configuration, HDFS can continue to operate without the failed disk, whereas with RAID, failure of a single disk causes the whole array to become unavailable.

1. Cluster Specification • How large should be your cluster be? • The beauty of Hadoop is that you can start with a small cluster and grow it as your storage and computational need grow. • For a small cluster, it is usually acceptable to run the namenode and the jobtracker on a single master machine. • As the cluster and the number of files stored in HDFS grow, the namenode needs more memory, so the namenode and jobtracker should be moved onto separate machines.

1. Cluster Specification • Network Topology • To get maximum performance out of Hadoop, it is important to configure Hadoop so that it knows the topology of your network. • If cluster runs on a single rack, then there is nothing more to do, since this is the default. • However, for multi rack clusters, you need to map nodes to racks. • The jobtracker uses network location to determine where the closest replica is as input for a map task that is scheduled to run on a tasktracker. • The Hadoop configuration must specify a map between node addresses and networks locations. • Since the default implementaion is ScriptBasedMapping, which runs a user-defined script to determine the mapping.

2. Cluster Setup and Installation • There are various ways to install and configure Hadoop. • To ease the burden of installing and maintaining the same software on each node, it is normal to use an automated installation method like Red Hat Linux’s Kickstart or Debian’s Fully Automatic Installation. • Hadoop installation is done in three steps. • Installing Java • Creating a Hadoop user • Installing Hadoop

2. Cluster Setup and Installation • Installing Java • Java 6 or later is required to run Hadoop. • The following command confirms that Java was installed correctly • Creating a Hadoop user • It’s good practice to create a dedicated Hadoop user account to separate the Hadoop installation from other service running on the same machine. • Installing Hadoop • Download Hadoop from the Apache Hadoop releases page, and unpack the contents of the distribution in a sensible location, such as /usr/local.

3. SSH Configuration • The Hadoop control scripts rely on SSH to perform cluster-wide operations. • To work seamlessly, SSH needs to be set up to allow password-less login for the Hadoop user from machines in the cluster. • Procedures for performing • First, generate an RSA key pair by typing the following in the Hadoop user account. • Next, we need to make sure that the public key is in the ~/.ssh/authorized_keys file on all the machines in the cluster that we want to connect to. • Test that you can SSH from the master to a worker machine by making sure SSH agent is running.

4. Hadoop Configuration • Hadoop Configuration • Configuration Management • Environment Settings • Important Hadoop Daemon Properties • Other Hadoop Properites

4. Hadoop Configuration – Configuration Management • Each Hadoop node in the cluster has its own set of configuration files, and it is up to administrators to ensure that they are kept in sync across the system. • Hadoop is designed so that it is possible to have a single set of configuration files that are used for all master and worker machines. • The great advantage of this is simplicity, both conceptually and operationally.

4. Hadoop Configuration – Configuration Management • If you expand the cluster with new machines that have a different hardware specification to the existing ones. • You need a different configuration for the new machine. • In these case, you need to have the concept of a class of machine. • There are several excellent tools such as Pupper, cfengine, bcfg2. • Control scripts • start-dfs.sh • The start-dfs.sh script, which start all the HDFS daemons in the cluster, runs the namenode on the machine the script is run on. • start-mapred.sh • There is a similar script called start-mapred.sh, which starts all the MapReduce daemon in the cluster.

4. Hadoop Configuration – Configuration Management • Master node scenarios • Depending on the size of the cluster, there are various configurations for running the master daemons : the namenode, secondary namenode, and job tracker. • The namenode ahs high memory requirements, as it holds file and block metadata for the entire namespace in memory. • The secondary namenode keeps a copy of the latest checkpoint of the file system metadata that it creates. • Keeping this backup on a different node to the namenode allows recovery in the event of loss of all the namenode’s metadata files. • The jobtracker uses considerable memory and CPU resources, so it should run on a dedicated node.

4. Hadoop Configuraition– Environment Settings • Memory • By default, Hadoop allocates 1000 MB of memory to each daemon it runs. • The maximum number of map tasks that will be run on a tasktracker at one time is controlled by the mapred.tasktracker.map.tasks.maximum property, which defaults to two tasks. • The memory given to each of these child JVMs can be changed by setting the mapred.child.java.opts property. • Because MapReduce jobs are normally I/O-bound, it makes sense to have more tasks than processor to get better utilization.

4. Hadoop Configuraition– Environment Settings • Java • The location of the Java implementation to use is determined by the JAVA_HOME setting in haddop-env.sh. • System logfiles • System logfiles produced by Hadoop are stored in $HADOOP_INSTALL/logs by default. • This can be changed using the HADOOP_LOG_DIR setting in hadoop-env.sh. • Old logfiles are never deleted, so you should arrange for them to be periodically deleted or archived, so as to not run out of disk space on the local node.

4. Hadoop Configuration – Important Hadoop Daemon Property • The properties are set in the Hadoop site files • core-site.xml, hdfs-site.xml, mapred-site.xml • Example shows a typical example set of files. • Notice that most are marked as final, in order to prevent them from being overridden by job configurations.

4. Hadoop Configuration – Important Hadoop Daemon Property • HDFS • There are a few other configuration properties you should set for HDFS • Those that set the storage directories for the namenode and for datanodes. • The property dfs.name.dirspecifies a list of directories where the namenode stores persistent filesystem metadata. • You should also set the dfs.data.dir property, which specifies a list of directories for a datanode to store its blocks. • Finally, you should configure where the secondary namenode stores its checkpoints of the filesystem.

4. Hadoop Configuration – Important Hadoop Daemon Property • MapReduce • To run MapReduce, you need to designate one machine as a jobtracker. • Set the mapred.job.tracker property to the hostname or IP address and port that the jobtracker will listen on. • Note that this property is not a URI, but a host-port pair, separated by a colon.

4. Hadoop Configuration – Other Hadoop Properties • Cluster membership • To aid the addition and removal of nodes in the future, you can specify a list of authorized machines that may join the cluster as datanodes or tasktrackers. • The list is specified using the dfs.hostsand mapred.hosts properties. • Service-level authorization • You can define ACLs to control which users and groups have permission to connect to each Hadoop service. (hadoop.security.authorization property) • Buffer size • 64KB or 128KB are common choices. Set this using the io.file.buffer.size property in core-site.xml.

4. Hadoop Configuration – Other Hadoop Properties • HDFS block size • The HDFS block size is 64MB by default. Set this using the dfs.block.size property in hdfs-site.xml • Trash • Hadoop filesystems have a trash facility, in which deleted files are not actually deleted. Set this using the fs.trash.interval property in core-site.xml • Task memory limits • You can set mapred.child.ulimit property, which sets a maximum limit on the virtual memory of the child process launched by the tasktracker.

5. Post Install • Once you have a Hadoop cluster up and running, you need to give users access to it. • This involves creating a home directory for each user, and setting ownership permission on it.

6. Benchmarking a Hadoop Cluster • Is the cluster set up correctly? • The best way to answer this question is empirically : run some jobs and confirm that you get the expected results. • Benchmarks make good tests. • Benchmarks are packaged in the test JAR file. • Benchmarking HDFS with TestDFSIO • TestDFSIO tests the I/O performance of HDFS. • It does this by using a MapReduce job as a convenient way to read or write files in parallel. • The following command writes 10 files of 1,000 MB each :

6. Benchmarking a Hadoop Cluster • Benchmarking MapReduce with Sort • The three steps are : generate some random data, perform the sort, then validate the results. • First, we generate some random data using RandomWriter. • Next, we can run the Sort program. • As a final sanity check, we validate the data in sorted-data is, in face, correctly sorted:

6. Benchmarking a Hadoop Cluster • There are many more Hadoop benchmarks, but the following are widely used. • MRBench runs a small job a number of times. It acts as a good counterpoint to sort, as it checks whether small job runs are responsive. • NNBench is useful for load testing namenode hardware. • Girdmix is a suite of benchmarks designed to model a realistic cluster workload, by mimicking a variety of data-access patterns seen it practice.

Distributed and Parallel Processing Technology Chapter 9. Setting up a Hadoop Cluster