Cloud Distributed Computing Environment

Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)

Hadoop is an open-source software system that provides a distributed computing environment on cloud (data centers) • It is a state-of-the-art environment for Big Data Computing • It contains two key services: - reliable data storage using the Hadoop Distributed File System (HDFS) - distributed data processing using a technique called MapReduce.

History of Hadoop • Hadoop was created by Doug Cutting, the creator of Apache Lucence, the widely used text search library. • Hadoop has its origins in Apache Nutch, an open source web search engine.

In 2004, Google published the paper that introduce MapReduce to the world. • Early in 2005, the Nutch developers had a working MapReduce implementation in Nutch, and by the middle of that year, all the major Nutch algorithms had been ported to run using MapReduce and NDFS

In Februray 2006, NDFS and the MapReduce moved out of Nutch to form an independent subproject of Lucene called Hadoop • At around the same time, Doug Cutting joined Yahoo!, which provided dedicated team and the resources to turn Hadoop into a system that ran at web scale. • In February 2008, Yahoo! Announced that its production search index was being generated by 10,000-core Hadoop cluster

In January 2008, Hadoop was made its own top-level project at Apache, confirmming its success and its diverse, active community • By the same time, Hadoop has been used by many companies besides Yahoo!, such as Last.fm, Facebook, and the New York Times. • In April 2008, Hadoop broke a world record to become the fastest system to sort terabyte of data • In November 2009, Google reported that its MapReduce implmentation sorted one terabyte in 68 seconds.

Introduction • HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware. - Very large file: some hadoop clusters stores petabytes of data. - Streaming data access: HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. - Commodity hardware: Hadoop doesn’t require expensive, highly reliable harware to run on. It is designed to run on clusters of commodity hardware.

Basic Concepts • Blocks - Files in HDFS are broken into block-sized chunks. Each chunk is stored in an independent unit. - By default, the size of each block is 64 MB.

- Some benefits of splitting files into blocks. -- a file can be larger than any single disk in the network. -- Blocks fit well with replication for providing fault tolerance and availability. To insure against corrupted blocks and disk/machine failure, each block is replicated to a small number of physically separate machines.

Namenodes and Datanodes - The namenode manages the filesystem namespace. -- It maintains the filesystem tree and the metadata for all the files and directories. -- It also contains the information on the locations of blocks for a given file. - datanodes: stores blocks of files. They report back to the namenodes periodically

The Command-Line Interface Copy a file from local filesystem to HDFS. Copy a file from HDFS to local filesystem. Compare these two local files

Hadoop FileSystems • The Java abstract class org.apache.hadoop.fs.FileSystem represents a filesystem in Hadoop. There are several concrete implementation of this abstract class. HDFS is one of them.

HDFS Java Interface • How to read data from HDFS in Java programs • How to write data to HDFS in Java programs

Reading data using the FileSystem API

Writing data

Data Flow

Cloud Distributed Computing Environment

Cloud Distributed Computing Environment

Presentation Transcript

Optical Networking for Distributed Computing Environment

Simulation in a Distributed Computing Environment

Distributed Computing Environment (DCE)

COS 497 - Cloud Computing 2. Distributed Computing

Geant4 in a Distributed Computing Environment

Distributed Computing Environment (DCE)

Resource Management in Cloud Computing and Distributed Computing

Distributed Computing Environment (DCE)

Locating Mobile Agents in Distributed Computing Environment

DCE (distributed computing environment)

Locating Mobile Agents in Distributed Computing Environment

Distributed Intrusion Detection System using Mobile Agents in Cloud Computing Environment

Developing Application in Distributed Computing Environment (DCE)

Geo-Distributed Cloud Computing

System Models for Distributed and Cloud Computing

Scheduler A Distributed Computing Environment

Distributed Intrusion Detection System using Mobile Agents in Cloud Computing Environment

Penetration Testing of Cloud Computing Environment

DISTRIBUTED COMPUTING ENVIRONMENT

Distributed Cloud Environment for PL-Grid Applications

DCE (distributed computing environment)