1 / 17

Taming the Big Data Fire Hose

This article explores the challenges and requirements of managing high-velocity, high-volume data in big data applications. It discusses the need for real-time validation, counting, aggregating, and analyzing of data, as well as the importance of scaling on demand and transactional capabilities. The article also covers big data management infrastructure options, including NewSQL and NoSQL solutions.

mbolton
Télécharger la présentation

Taming the Big Data Fire Hose

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Taming the Big DataFire Hose John HuggSr. Software Engineer, VoltDB

  2. Big Data Defined • Velocity • Moves at very high rates (think sensor-driven systems) • Valuable in its temporal, high velocity state • Volume • Fast-moving data creates massive historical archives • Valuable for mining patterns, trends and relationships • Variety • Structured (logs, business transactions) • Semi-structured and unstructured

  3. Example Big Data Use Cases Lower-frequency operations High-frequency operations DataSource Financial trademonitoring Telco call data record management Website analytics,fraud detection Online gaming micro transactions Digital ad exchange services Wireless location-based services

  4. Big Data and You Big Data and You • Incoming data streams are different than traditional business apps • You need to write data quickly and reliably, but … • It’s not just about high speed writes • You need to validate in real-time • You need to count and aggregate • You need to analyze in real-time • You need to scale on demand • You may need to transact

  5. Big Data Management Infrastructure High Velocity High Volume Online gaming Analytic Datastore NewSQL Ad serving • Structured data • ACID guarantees • Relational/SQL • Real-time analytics Sensor data Financial trade Internet commerce • Unstructured data • Eventual consistency • Schemaless • KV, document SaaS, Web 2.0 Mobile platforms Other OLAP data stores NoSQL

  6. Big Data Management Infrastructure High Velocity High Volume Online gaming Analytic Datastore NewSQL Ad serving Sensor data Financial trade Internet commerce SaaS, Web 2.0 Mobile platforms Other OLAP data stores NoSQL

  7. High VelocityData Management

  8. High Velocity DBMS Requirements • Ingest at very high speeds and rates • Scale easily to meet growth and demand peaks • Support integrated fault tolerance • Support a wide range of real-time (or “near-time”) analytics • Integrate easily with high volume analytic datastores

  9. High Speed Data Ingestion • Support millions of write operations per second at scale • Read and write latencies below 50 milliseconds • Provide ACID-level consistency guarantees (maybe) • Support one or more well-known application interfaces • SQL • Key/Value • Document

  10. Scale to Meet Growth and Demand • Scale-out on commodity hardware • Built-in database partitioning • Manual sharding and/or add-on solutions are brittle, require apps to do “heavy lifting”, and can be an operational nightmare • Database must automatically implement defined partitioning strategy • Application should “see” a single database instance • Database should encourage scalability best practices • For example, replication of reference data minimizes need for multi-partition operations

  11. A Look Inside Partitioning Partition 1 Partition 2 Partition 3 select count(*) from orders where customer_id = 5 single-partition 1 101 2 1 101 3 4 401 2 2 201 1 5 501 3 5 502 2 3 201 1 6 601 1 6 601 2 select count(*) from orders where product_id = 3 multi-partition 1 knife 2 spoon 3 fork 1 knife 2 spoon 3 fork 1 knife 2 spoon 3 fork insert into orders (customer_id, order_id, product_id) values (3,303,2) single-partition update products set product_name = ‘spork’ where product_id = 3 multi-partition table orders : customer_id (partition key) (partitioned) order_id product_id table products : product_id (replicated) product_name

  12. Integrated Fault Tolerance • Database should transparently support built-in “Tandem-style” HA • Users should be able to easily increase/decrease fault tolerance levels • Database should be easily and quickly recoverable in the event of severe hardware failures • Database should be able to automatically detect and manage a variety of partition fault conditions • Downed nodes should be “rejoinable” without the need for service windows

  13. Partition Detection & Recovery Network fault protection • Detects partition event • Determines which side of fault to disable • Snapshots and disables orphaned node(s) Server A Server C Server B Live node rejoin • Allows “downed” nodes to rejoin live cluster • Automatically re-synchs all node data • Coordinates transactions during re-synch Server A Server C Server B

  14. Real-time Analytics • Database should support a wide variety of high performance reads • High-frequency single-partition • Lower-frequency multi-partition • Common analytic queries should be optimized in the database • Multi-partition aggregations, limits, etc. • Database should accommodate a flexible range of relational data operations • Particularly relevant to structured data

  15. Integration with Analytic Datastores • Database should offer high performance, transactional export • Export should allow a wide variety of common data enrichment operations • Normalize and de-normalize • De-duplicate • Aggregate • Architecture should support loosely-coupled integrations • Impedance mismatches • Durability

  16. VoltDB Export Data Flow High Velocity Database Cluster • Loosely-coupled, asynchronous • Queue must be durable • Bi-directional durability

  17. Summary • Big Data infrastructures will usually require more than one engine • High velocity engine for “fast” data • Analytic engine for “deep” data • Data characteristics will often determine which high velocity engine to use • NewSQL is often well-suited to structured data • NoSQL is often a good fit for unstructured data • Choose solutions that suit your needs and are designed for interoperability

More Related