Outline

Outline • Part I: Introduction (Pedro A.) • Part II: Technical Solutions (Massimo P., Benjamin F.) • Transport • Long Term Repository • Analytics • Visualization • Part III: Experience by Services (Stefano Z., Spyros L.) • OpenStack Monitoring • Batch LSF Monitoring

History 2012 ITTF slide • Motivation • Several independent monitoring activities in IT • similar overall approach, different tool-chains, similar limitations • High level services are interdependent • combination of data from different groups necessary, but difficult • Understanding performance became more important • requires more combined data and complex analysis • Move to a virtualized dynamic infrastructure • comes with complex new requirements on monitoring • Challenges • Find a shared architecture and tool-chain components while preserving our investment in monitoring

Architecture Visualization Notifications Analysis Long Term Repository Application Feed Feed Feed Transport Producer Producer Producer Producer

Strategy • Adopt open source tools • For each architecture block look outside for solutions • Large adoption and strong community support • Fast to adopt, test, and deliver • Easily replaceable by other (better) future solutions • Integrate with new CERN infrastructure • AI project, OpenStack, Puppet, Roger, etc. • Focus on simple adoption (e.g. puppet modules)

Technology Part II

Community • DB (web apps) • DSS (castor) • OIS (openstack) • PES (batch lsf) • SDC (wlcgmon.) • Sec. (netlog, snoopy) CF(lemon, syslog) Part III • Same technologies being used by different teams • HDFS: lemon, syslog, openstack, batch, security, castor • ES: lemon, syslog, openstack, batch • Kibana: lemon, syslog, openstack, batch

Part II Transport

Motivation • Scalable transport needed • Collect operations data • lemon metrics and syslog • 3rd party applications • Easy integration with providers/consumers • Apache Flume

Flume • Distributed service for collecting large amounts of data • Robust and fault tolerant • Horizontally scalable • Many ready to be used input/output plugins • Java based • Apache license • Cloudera is the main contributor • Using their releases • Less frequent but more stable releases

Data Flow • Flume event • Byte payload + set of string headers • Flume agent • JVM process hosting “source -> sink” flow(s)

Sources and Sinks • Many ready-to-be-used plugins • Sources • Avro, Thrift, JMS, Spool dir, Syslog, HTTP, … • Custom sources can be easily implemented • we do have a dirq source for our use case • Interceptors • Decorate events, filter events

Sources and Sinks • Many ready-to-be-used plugins • Channels • Memory, File, JDBC • Custom channels can be easily implemented • Sinks • Avro, Thrift, ElasticSearch, Hadoop HDFS & HBase, Java Logger, IRC, File, Null • Custom sinks can be easily implemented

Other Features • Fan-in and fan-out • Enable load balancing • Contextual routing • Based on logic implemented through selectors • Multi-hops flows • Enable layered topologies • Increase reliability, failure resistance

Limitations • Routing is static • On demand subscriptions are not possible • Requires reconfiguration and restart • No authentication/authorization features • Secure transport available • Java process on client side • Small memory footprint would be nicer

Our Deployment

Our Deployment • Producers • All Puppet nodes • Lemon, Syslog, 3rd party applications • Gateway routing layer • 10 VMs behind DNS load balancer • Elasticsearch sink • 5 VMs behind DNS load balancer • Inserting to ElasticSearch • Hadoop HDFS sink • 5 VMs behind DNS load balancer • Inserting to Hadoop HDFS

Feedback • Needs tuning to correctly size flume layers • Available sources and sinks saved a lot of time

Long Term Repository Part II

Motivation • Store operations raw data • Long term archival required • Allow future data replay to other tools • Feed real-time engine • Offline processing of collected data • Security data? Syslog data? • Apache Hadoop/HDFS 20

Apache Hadoop • Framework that allows the distributed processing of large data sets across clusters • HDFS is a distributed filesystem designed to run on commodity hardware • Suitable for applications with large data sets • Designed for batch processing rather than interactive use • High throughput preferred to low latency access

Limitations • Small files not welcome • Blocks of 64M,128M • Tens of millions files limit per cluster • Namenode holding in memory files map • Transparent compression not available • Raw text could take much less space • Real-time data access is not possible 22

Our Usage • Cluster provided by IT/DSS • ~500 TB, 13 data nodes • Data stored by hostgroup • Total 1.8 TB since mid July 2013 • Daily jobs to aggregate data by month • Large files preferred to many small files by HDFS 23

Part II Analytics

Motivation • Real-time queries, clear API • Limited data retention • Multiple scopes technologies • Horizontally scalable and easy to deploy • ElasticSearch 25

ElasticSearch Distributed RESTful search and analytics engine 26

ElasticSearch Real time • Acquisition: data is indexed in real time • Analytics: explore, understand your data

ElasticSearch Schema free • No prior data declaration required • but possible, to optimize • Data is injected as-is • Automatic data type discovery Document oriented (JSON)

ElasticSearch • Full text search • Apache Lucene is used to provide full text search • lucene apache documentation • But not only text • integer/long • float/double • boolean • date • binary • ...

ElasticSearch • High availability • Shards and replicas auto balanced • RESTful JSON API [root@es-search-node ~] $ curl -XGET http://localhost:9200/_cluster/health?pretty=true { "cluster_name" : "itmon-es", "status" : "green", "timed_out" : false, "number_of_nodes" : 11, "number_of_data_nodes" : 8, "active_primary_shards" : 2990, "active_shards" : 8970, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 0 }

ElasticSearch • Used by many large companies • Soundcloud • “To provide immediate and relevant results for their online audio distribution platform reaching180 million people” • Github • “20TB of data using ElasticSearch, including 1.3 billion files and 130 billion lines of code” • Foursquare, Stackoverflow, Salesforce, ... • Distributed under Apache license

Limitations • Requires a lot of RAM (Java) • Especially on data nodes • IO intensive • Take into account when planning deployment • Shards re-initialisation takes some time (~1h) • Lots of shards and replicas per index, lots of indexes • Not frequent operation, only after full cluster reboot • Authentication not built-in (“bricolage”) • Apache+Shibboleth on top of Jetty plugin

Our Deployment • Fully puppetized • Production cluster • 2 master nodes (no data) • 16GB RAM, 8 cores CPU • 1 search node (no data) • 16GB RAM, 8 cores CPU • 8 data nodes • 48GB RAM, 24 cores CPU • 500GB SSD • Development cluster • Based on medium and large VMs

Our Deployment • Security: Jetty plugin • Access control, SSL (also requests logging, Gzip) • Monitoring: many plugins • ElasticHQ, BigDesk, Head, Paramedic, ...

Our Deployment • 1 index per day • flume-lemon-YYYY-MM-DD • flume-syslog-YYYY-MM-DD • 90 days TTL • 10 shards per index • 2 replicas per shards

Demo Production Cluster • ElasticHQ • HEAD

Feedback • Easy to deploy and manage • Robust, fast, and rich API • Easy query language (DSL) • More features coming with aggregation framework

Part II Visualisation

Motivation • Dedicated, dynamic and user-friendly dashboards • Horizontally scalable and easy to deploy • Kibana

Kibana Visualize time-stamped data from ElasticSearch

Kibana • “Make sense of a mountain of logs” • Designed to analyze log • Perfectly fits timestamped data (e.g. metrics) Profit from ElasticSearch power • Search/analyze features exploited

Kibana • No code required • Simply point & click to build your own dashboard

Kibana • Open source, community driven • Now fully integrated and supported by ElasticSearch • Provided code/feature contribution

Kibana Built with AngularJS • JavaScript MVC framework for client-side rich application • Developed and maintained by Google • No backend: web server delivers only static files • JS directly queries ElasticSearch

Kibana Easy to install • “git clone” OR “tar -xvzf” OR ElasticSearch plugin Easy to configure • 1-line config file to point to the ElasticSearch cluster • Save its own configuration in ElasticSearch itself • Possible to export/import dashboards configuration

Our Deployment Based on ElasticSearch plugin • To profit from Jetty authentication • Deployed together with search node • Public (read only) and private (read write) endpoints

Demo • Production Dashboards • Syslog • Lemon • PDUs

Feedback • Easy to deploy and use • Cool user interface • Fits many use cases • Text (syslog), metrics (lemon) • Still limited feature set • Under active development • Very active community and growing

OpenStack Monitoring Part III

Experience with OpenStack

Outline

Outline

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

Outline

Outline

outline

outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline:

Outline

Outline

OUTLINE: