1 / 46

Managing 1M Events/s in Google Cloud

Learn how to handle a high volume of events in Google Cloud using DataFlow, with insights on architecture, challenges, and best practices.

gfernandez
Télécharger la présentation

Managing 1M Events/s in Google Cloud

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to handle 1 000 000 events per second in Google Cloud

  2. About me Oleksandr Fedirko BigData Architect at GlobalLogic I do BigData enabling on a projects Training and mentoring on BigData skills alexander.fedirko@gmail.com https://www.linkedin.com/in/fedirko/

  3. Use Google DataFlow service

  4. Q&A session

  5. Developer vs Data engineer OOP SOLID GoF Java C++ C# JavaScript Unit tests TDD DWH Business Intelligence Data Science DBA ETL Pipeline Reports R Data analysis

  6. Agenda • Starting point and basic assumptions at the project • Evolution of the Cloud solution • Challenges that push decisions • Research on a BigData project, value of micro PoCs • NFRs on a BigData project • Good things that helped a lot on a project • A place of ML\AI in the system • Conclusions

  7. Starting point and basic assumptions at the project

  8. Starting point and basic assumptions at the project • Cloud agnostic • User’s defined CEP Rules (complex event processing) • 100 Data Source Types (Cisco ASA, Gigamon Netflow, Windows, Unix etc) • 10 000 Data Sources (Routers, PCs, Servers etc) • Need of ML\AI • Analytics • Quick search • SSO integration

  9. Starting point and basic assumptions at the project Example of the Rules (High Traffic) When The event(s) were detected by one or more of these data source types "NetFlow" And Bytes is greater than 1048576 bytes (1Mb) Then Create Indicator “High Traffic” End

  10. Starting point and basic assumptions at the project Example of the Rules (Port Scanning) When The event(s) were detected by one or more of these data source types "Cisco ASA" With the same source IP and destination IP more than 5 times, across more than 5 destination ports within 4 min Then Create Incident “Port Scanning” of threat type "External Hacking" End

  11. Starting point and basic assumptions at the project Requirement example: Data Sources would be part of the Identity Database Product must integrate with the CMDB for the list of devices to be monitored. Product must be capable of indexing terabytes of normalized log data and provide performance in both indexed and table scans the exceeds search results of 1 million records a second.

  12. Starting point and basic assumptions at the project Problems? • 1000 eps through Drools • No Autoscale on DataProc • Manage custom adapters via OpenShift cluster • Stateful backend

  13. Evolution of the Cloud solution

  14. Evolution of the Cloud solution Limitation via SoW (Statement of Work) • GCP bounded • Exclude real time event view • Exclude metrics UI • Postpone AI\ML implementation • Postpone Analytical storage implementation • No sensitive data in the system • Exclude audit logging

  15. Evolution of the Cloud solution Technology transform • From Azkaban to AirFlow • Requirements to SRS (Software requirements specification) • From mutable rows to immutable • From Spark to Beam+DataFlow • Agree on NiFi as primary Ingest tech, get rid of custom Java adapters

  16. 13 17 10 06 15 11 01 02 04 05 08 16 07 03 09 14 12 Google Cloud On-prem & Distributed Locations Archive Data Storage (Google Cloud Storage Bucket) Google Compute Engine Google Compute Engine Realtime Stream Compute Realtime Stream Compute Google Dataflow (Apache Beam) IaaS NiFi Compute IaaS Kafka Compute Primary Data Storage (Google BigTable) IaaS NiFi Compute Kafka Compute Push Data Source* Push Data Source Secondary Data Storage Filesystem (Google Cloud Engine Local File System) Filesystem (Google Cloud Engine Local File System) IaaS Elastic Compute Pull Data Source* IaaS Elastic Compute Google Compute Engine Pull Data Source Filesystem (Google Cloud Engine Local File System) IaaS SecA Application IaaS SecA Web Application OpenShift *Data Source Inventory Phase 2 Data Source Types: • Cisco ASA • F5 DNS • Cisco Ironport • Windows Data Source • NetFlow • Bit9 • Unix • Protegrity • BlueCat • Cisco FireSight IaaS Airflow Compute Google Compute Engine Metrics Datastore Airflow Compute IaaS MySQL (Primary) IaaS OpenTSDB IaaS MySQL (Slave) Google Compute Engine Scheduler / Workflow Orchestration (IaaS Airflow) IaaS OpenTSDB Google Compute Engine Filesystem (Google Cloud Engine Local File System) Google BigTable

  17. Challenges that push decisions

  18. Challenges that push decisions How to solve stateful processing problem? • Share state on database? • What kind? Key-value? • If not then share states on stream processing workers • Can they store 250k eps for 5 minutes? 1 hour? 1 day? • What to do with late arrivals?

  19. Challenges that push decisions How to collect metrics (infra\middleware\application)? • Customer care less of infra level metrics • Most of the metrics are throughput of the middleware (NiFi\Kafka\DataFlow) • How to measure DataFlow performance? There is nothing on Google StackDriver Tip: use out-of-the-box APIs as much as possible

  20. Challenges that push decisions How to measure delay on component? • Call Kafka API for offsets? • What to do with NiFi? • How to measure delay on DataFlow?

  21. Research on a BigData project, value of micro PoCs

  22. Research on a BigData project, value of micro PoCs More than 20 PoCs (Research Spikes) within 1 year

  23. Research on a BigData project, value of micro PoCs For DataFlow • Can it make 250k eps ? • Does Beam fit well? • Would DataFlow autoscaling work fine?

  24. Research on a BigData project, value of micro PoCs For GCP Datastore • Would it make 250k eps? • Can it be easily accessible? • Could it be integrated with DataFlow?

  25. Research on a BigData project, value of micro PoCs For GCP PubSub • Would it make 250k eps? • Can it deliver every message? • Can it scale up and down?

  26. Research on a BigData project, value of micro PoCs For AirFlow • Can we start static stream jobs from AirFlow? • Can we manage batch jobs via AirFlow by schedule? • Can we replace Azkaban with AirFlow? • What kind of resources do we need for AirFlow?

  27. Research on a BigData project, value of micro PoCs For NiFi overflow (to comply with zero messages loss) • What should NiFi do when the downstream (Kafka) is down? • What should NiFi do when the downstream (Kafka) just start throttling? • Store files to infinite storage • Process them later • Do not create extra pressure on Kafka

  28. Research on a BigData project, value of micro PoCs For Replay service • How to recreate throughput on another environment? • Execute in parallel or sequentially? • What kind of UI to provide for user?

  29. Research on a BigData project, value of micro PoCs For Kafka Manual Commit, to cover 0 message loss we have to switch to alternatives of Auto Commit (by default) • Can we switch to non-autocommit on DataFlow? • Can we switch to non-autocommit on custom Kafka consuming jobs written in Java (Spring Cloud)? commitOffsetsInFinalize found. The problem is in its definition: “It helps with minimizing gaps or duplicate processing of records while restarting a pipeline from scratch. But it does not provide hard processing guarantees.”

  30. NFRs on a BigData project

  31. NFRs on a BigData project • No message loss • 250k eps, with 1kk eps spikes • All secrets in Hashicorp • Appliance with OWASP best practices • Static Code Analysis • End-to-End TLS for all connectivity • No-downtime application update

  32. NFRs on a BigData project DevOps NFRs • Service Discovery (via Consul) • Circuit Breaker (via Hystrix\Resilience4j) • Health Check (Spring Cloud) • Start pod on OpenShift without any dependency (lazy start), give 200 response and fail later

  33. Good things that helped a lot on a project

  34. Good things that helped a lot on a project Extra team for CEP • 3-5 people • Isolate from other members • Core functionality first, integration later

  35. Good things that helped a lot on a project • Custom data generator • Custom scenarios • Throughput generation • Custom stream manager • Start\stop\restart

  36. Good things that helped a lot on a project • Keep your software design and architecture up-to-date • Only live schemas in your Wiki, no static images • Make code review for everything

  37. A place of ML\AI in the system

  38. 13 17 10 06 15 11 01 02 04 05 08 16 07 03 09 14 12 Google Cloud On-prem & Distributed Locations Archive Data Storage (Google Cloud Storage Bucket) Google Compute Engine Google Compute Engine Realtime Stream Compute Realtime Stream Compute Google Dataflow (Apache Beam) IaaS NiFi Compute IaaS Kafka Compute Primary Data Storage (Google BigTable) IaaS NiFi Compute Kafka Compute Push Data Source* Push Data Source Secondary Data Storage Filesystem (Google Cloud Engine Local File System) Filesystem (Google Cloud Engine Local File System) IaaS Elastic Compute Pull Data Source* IaaS Elastic Compute Google Compute Engine Pull Data Source Filesystem (Google Cloud Engine Local File System) IaaS SecA Application IaaS SecA Web Application OpenShift *Data Source Inventory Phase 2 Data Source Types: • Cisco ASA • F5 DNS • Cisco Ironport • Windows Data Source • NetFlow • Bit9 • Unix • Protegrity • BlueCat • Cisco FireSight IaaS Airflow Compute Google Compute Engine Metrics Datastore Airflow Compute IaaS MySQL (Primary) IaaS OpenTSDB IaaS MySQL (Slave) Google Compute Engine Scheduler / Workflow Orchestration (IaaS Airflow) IaaS OpenTSDB Google Compute Engine Filesystem (Google Cloud Engine Local File System) Google BigTable

  39. ML\AI use cases • Train on a dataset from BigTable • Apply model in real time within the Rules Engine • Apply model on a batch data from BigTable Typical AI\ML tasks for security analytics: • Anomaly detection • Fuzzy logic to identify host in Identity Database • Malicious use of Rules Engine • Statistical methods to auto adjust Rules

  40. Conclusions

  41. Conclusions • See something unknown - do micro PoC • Avoid mutable objects in Big Data • Limit the scope to the real deliverable product • Requirements too fuzzy? Make your own! • DevOps are your best friends (QA to) • Do not use Gerrit • Sketch everything before you start develop

  42. Q&A session

  43. Thank you!

More Related