1 / 19

项目调研: Hadoop

软件体系结构作业 1. 项目调研: Hadoop. 杨晓亮 MG0933047 南京大学计算机科学与技术系 yangxiaoliang2006@gmail.com 2010-3-23. 内容. H adoop 介绍 H adoop 的体系结构 H adoop 的应用. 2014/11/18. 2. 引言. KB 1000 MB 1000,000 GB 1000,000,000 TB 1000,000,000,000 PB 1000,000,000,000,000 … …. 我们今天所要面对的数据量. Google 处理的数据量.

Télécharger la présentation

项目调研: Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 软件体系结构作业1 项目调研:Hadoop 杨晓亮MG0933047南京大学计算机科学与技术系yangxiaoliang2006@gmail.com2010-3-23

  2. 内容 Hadoop介绍 Hadoop的体系结构 Hadoop的应用 2014/11/18 2

  3. 引言 • KB 1000 • MB 1000,000 • GB 1000,000,000 • TB 1000,000,000,000 • PB 1000,000,000,000,000 • … … 我们今天所要面对的数据量

  4. Google 处理的数据量 • Google processes over 20 petabytes of data per day(Niall Kennedy,Jan/2008) • 0.34秒…

  5. Google的三个分布式基础设施 • GFS(Google File System) • MapReduce • BigTable • Google先后发表了几篇重要的文章: • Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. 2003 • Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplied Data Processing on Large Clusters. 2004 • Fay Chang, Jeffrey Dean, Sanjay Ghemawat. Bigtable: A Distributed Storage System for Structured Data. 2006 • Mike Burrows. The Chubby lock service for loosely-coupled distributed systems.

  6. Hadoop 介绍 • Hadoop是 Apache 的一个开源软件项目,由Doug Cutting在2004年开始开发。 • Hadoop是一个海量数据存储和计算的分布式系统,它由若干个成员组成,主要包括:HDFS、MapReduce、HBase、Hive、Pig 和 ZooKeeper, 其中HDFS是Google的GFS开源版本, HBase 是Google的 BigTable开源版本,ZooKeeper是Google的Chubby开源版本。 • Hadoop在大量的公司中被使用和研究

  7. Hadoop 介绍 A9.com - Amazon We build Amazon's product search indices using the streaming API and pre-existing C++, Perl, and Python tools. We process millions of sessions daily for analytics, using both the Java and streaming APIs. 2014/11/18 7

  8. Hadoop 介绍 Yahoo! More than 100,000 CPUs in >25,000 computers running Hadoop Our biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) Used to support research for Ad Systems and Web Search Also used to do scaling tests to support development of Hadoop on larger clusters 2014/11/18 8

  9. Hadoop 介绍 Facebook We use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning. Currently we have 2 major clusters: A 1100-machine cluster with 8800 cores and about 12 PB raw storage. A 300-machine cluster with 2400 cores and about 3 PB raw storage. Each (commodity) node has 8 cores and 12 TB of storage. 2014/11/18 9

  10. Hadoop 介绍 Baidu - the leading Chinese language search engine Hadoop used to analyze the log of search and do some mining work on web page database handle about 3000TB per week 2014/11/18 10

  11. Hadoop 介绍 在中国,包括中国移动、网易、淘宝、腾讯、金山和华为等众多公司都在研究和使用它 2014/11/18 11

  12. Hadoop 的体系结构 Hadoop由以下几个部件组成: Hadoop Common: The common utilities that support the other Hadoop subprojects. Avro: A data serialization system that provides dynamic integration with scripting languages. Chukwa: A data collection system for managing large distributed systems. HBase: A scalable, distributed database that supports structured data storage for large tables. HDFS: A distributed file system that provides high throughput access to application data. Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying. MapReduce: A software framework for distributed processing of large data sets on compute clusters. Pig: A high-level data-flow language and execution framework for parallel computation. ZooKeeper: A high-performance coordination service for distributed applications. 2014/11/18 12

  13. Hadoop 的体系结构-HDFS HDFS的结构按照GFS设计 A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients 2014/11/18 13

  14. Hadoop 的体系结构-HDFS 2014/11/18 14

  15. Hadoop 的体系结构-MapReduce Architecture 2014/11/18 15

  16. Execution Overview Hadoop 的体系结构-MapReduce 2014/11/18 16

  17. MapReduce的数据流程 2014/11/18 17

  18. Hadoop的应用 As of October 2009, commercial applications of Hadoopincluded(from Wiki) Log analysis of various kinds Marketing analytics Machine learning and/or sophisticated data mining Image processing Processing of XML messages Web crawling and/or text processing …… 2014/11/18 18

  19. Referecne http://hpc.cs.tsinghua.edu.cn/dpcourse/index.htm http://hadoop.apache.org/ http://en.wikipedia.org/wiki/Hadoop 2014/11/18 19

More Related