项目调研： Hadoop

软件体系结构作业1 项目调研：Hadoop 杨晓亮MG0933047南京大学计算机科学与技术系yangxiaoliang2006@gmail.com2010-3-23

内容 Hadoop介绍 Hadoop的体系结构 Hadoop的应用 2014/11/18 2

引言 • KB 1000 • MB 1000,000 • GB 1000,000,000 • TB 1000,000,000,000 • PB 1000,000,000,000,000 • … … 我们今天所要面对的数据量

Google 处理的数据量 • Google processes over 20 petabytes of data per day（Niall Kennedy，Jan/2008） • 0.34秒…

Google的三个分布式基础设施 • GFS（Google File System） • MapReduce • BigTable • Google先后发表了几篇重要的文章： • Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. 2003 • Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplied Data Processing on Large Clusters. 2004 • Fay Chang, Jeffrey Dean, Sanjay Ghemawat. Bigtable: A Distributed Storage System for Structured Data. 2006 • Mike Burrows. The Chubby lock service for loosely-coupled distributed systems.

Hadoop 介绍 • Hadoop是 Apache 的一个开源软件项目,由Doug Cutting在2004年开始开发。 • Hadoop是一个海量数据存储和计算的分布式系统，它由若干个成员组成，主要包括：HDFS、MapReduce、HBase、Hive、Pig 和 ZooKeeper，其中HDFS是Google的GFS开源版本， HBase 是Google的 BigTable开源版本，ZooKeeper是Google的Chubby开源版本。 • Hadoop在大量的公司中被使用和研究

Hadoop 介绍 A9.com - Amazon We build Amazon's product search indices using the streaming API and pre-existing C++, Perl, and Python tools. We process millions of sessions daily for analytics, using both the Java and streaming APIs. 2014/11/18 7

Hadoop 介绍 Yahoo! More than 100,000 CPUs in >25,000 computers running Hadoop Our biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) Used to support research for Ad Systems and Web Search Also used to do scaling tests to support development of Hadoop on larger clusters 2014/11/18 8

Hadoop 介绍 Facebook We use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning. Currently we have 2 major clusters: A 1100-machine cluster with 8800 cores and about 12 PB raw storage. A 300-machine cluster with 2400 cores and about 3 PB raw storage. Each (commodity) node has 8 cores and 12 TB of storage. 2014/11/18 9

Hadoop 介绍 Baidu - the leading Chinese language search engine Hadoop used to analyze the log of search and do some mining work on web page database handle about 3000TB per week 2014/11/18 10

Hadoop 介绍 在中国，包括中国移动、网易、淘宝、腾讯、金山和华为等众多公司都在研究和使用它 2014/11/18 11

Hadoop 的体系结构 Hadoop由以下几个部件组成： Hadoop Common: The common utilities that support the other Hadoop subprojects. Avro: A data serialization system that provides dynamic integration with scripting languages. Chukwa: A data collection system for managing large distributed systems. HBase: A scalable, distributed database that supports structured data storage for large tables. HDFS: A distributed file system that provides high throughput access to application data. Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying. MapReduce: A software framework for distributed processing of large data sets on compute clusters. Pig: A high-level data-flow language and execution framework for parallel computation. ZooKeeper: A high-performance coordination service for distributed applications. 2014/11/18 12

Hadoop 的体系结构-HDFS HDFS的结构按照GFS设计 A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients 2014/11/18 13

Hadoop 的体系结构-HDFS 2014/11/18 14

Hadoop 的体系结构-MapReduce Architecture 2014/11/18 15

Execution Overview Hadoop 的体系结构-MapReduce 2014/11/18 16

MapReduce的数据流程 2014/11/18 17

Hadoop的应用 As of October 2009, commercial applications of Hadoopincluded（from Wiki） Log analysis of various kinds Marketing analytics Machine learning and/or sophisticated data mining Image processing Processing of XML messages Web crawling and/or text processing …… 2014/11/18 18

Referecne http://hpc.cs.tsinghua.edu.cn/dpcourse/index.htm http://hadoop.apache.org/ http://en.wikipedia.org/wiki/Hadoop 2014/11/18 19

项目调研： Hadoop