1 / 14

Hadoop IT Services

"Learn how Hadoop can help you with parallel processing of large amounts of data, performing analytics on a big scale, and dealing with diverse data structures. Explore the various components and configuration options available in Hadoop, and discover the recent activities and future plans for Hadoop at CERN."

stasia
Télécharger la présentation

Hadoop IT Services

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hadoop IT Services HadoopUsers Forum CERN October 7th,2015 CERN IT-D*

  2. Hadoop CPU CPU CPU CPU CPU CPU MEMORY MEMORY MEMORY MEMORY MEMORY MEMORY Interconnect network Disks Disks Disks Disks Disks Disks Node X Node 1 Node 2 Node 3 Node 4 Node 5 A framework for large scale data processing • Distributed storage and processing • Shared nothing architecture – scales horizontally • Optimized for high throughput on sequential data access

  3. How Hadoop Can Help You Parallel processing of large amounts of data Perform analytics on a big scale Dealing with diverse data: structured, semi-structured, unstructured ‘Cold’ storage / Archives Performance is usually suboptimal for Random reads and real-time access ‘Small’ datasets

  4. There are already interesting use cases of Hadoop @CERN • WLCG grid monitoring • Data Transfers etc. • Atlas Events Indexing • CASTOR log aggregation • Data Warehousing • Logging/time series data • IT monitoring

  5. Hadoop Service in IT • Setup and run the infrastructure • Provide consultancy • Build the community • Joint work • IT-DB and IT-DSS

  6. Hadoop Clusters in IT (Oct 2015) • lxhadoop (22 nodes) • general purpose cluster (mainly used by ATLAS) • stable software setup • recent hardware • analytix (56 nodes) • for analysis of monitoring data • varied hardware specifications • the biggest in terms of number of nodes • hadalytic (17 nodes) • general purpose cluster with additional services • recent hardware

  7. Many Configuration Options • Hadoop is a platform • Many components and key decisions in the implementation • Rapidly evolving field • Examples • Data access: domain specific language or SQL • Many components and data formats • Data loading and unloading tools

  8. Currently available components Sqoop Data exchange with RDBMS Pig Scripting Hive SQL Spark Large scale data proceesing Flume Log data collector Impala SQL Hbase NoSql columnar store Zookeeper Coordination MapReduce YARN Cluster resource manager HDFS Hadoop Distributed File System

  9. Software version policy Align to CDH distributions

  10. Maintenance activities • Actions • Upgrades to a newer CDH • Frequency • Typically twice a year • Impact • Downtime 1-3 hours

  11. Recent activities (last 3 months) Hadoop Tutorials – during summer Deployment of Coudera Impala component Monitoring of hanging HBase region servers Self-service Oracle2Hadoop integration (work in progress) Building a database of users’ data sources

  12. Contact points • Service is available in SNOW • SE: Hadoop Service • FE: Hadoop Components • FE: Hadoop Core • E-group: it-analytics-wg@cern.ch • Show up on the Wednesday’s meeting • Analytic Working Group • Hadoop User Forum

  13. How to Learn More • Hadoop tutorials at CERN, summer 2015 • Introduction to Hadoop(Architecture, HDFS, MapReduce, Spark) https://indico.cern.ch/event/404527/ • SQL on Hadoop (Hive, Impala) https://indico.cern.ch/event/434650/ • NoSQL on Hadoop (HBase) https://indico.cern.ch/event/442004/ • We plan to do more/repeats in the future

  14. Future plans • Infrastructure • HDFS backups • Rolling upgrades • Support from Cloudera? • Users community • Write a Knowledge Base (SNOW) • New features/technology testing • Kudu – a new columnar file system from Cloudera • Tachyon – in-memory file system

More Related