Real-time Analytics on Big Data Streams

Data Streams, Data-flow parallelism, and Real-time Analytics Christoph.Koch@epfl.ch -- EPFL DATA Lab

The Fail Whale

Science: sensors; feedback for experiment control • Monitoring, log analysis, root cause analysis for failures • Finance: algorithmic trading, risk management, OLTP • Commerce: OLTP • Web: Twitter, Facebook; search frontends (Google), personalized <anything>, clickstream analysis Real-time data processing use-cases

Big Data 3V: Volume, Velocity, Variety • This talk focuses on velocity and volume • Continuous data analysis • Stream monitoring & mining; enforcing policies/security. • Timely response required (low latencies!) • Performance: high throughput and low latencies! Real-time Analytics on Big Data

Paths to (real-time) performance Parallelization Small data (seriously!) Incrementalization (online/anytime) Specialization

Current data growth outpaces Moore’s law. • Sequential CPU performance does not grow anymore (already for three Intel processor generations). • Logical states need time to stabilize. • Moore’s law to fail by 2020: Only a few (2?) die shrinkage iterations left. • Limitation on number of cores. • Dennard scaling (the true motor of Moore’s law) has ended • Energy cost and cooling problems! • More computational power will always be more expensive! Comp. Arch. not to the rescue

Computer architecture • Failure of Dennard’s law: Parallelization is expensive! • Computational complexity theory • There are inherently sequential problems: NC<PTIME • Fundamental impossibilities in distributed computing: • Distributed computation requires synchronization. • Distributed consensus has a minimum latency dictated by spatial distance of compute nodes (and other factors). • msecs in LAN, 100s of msecs in WAN. Speed of light! • Max # of synchronous computation steps per second, no matter how much parallel hardware available. Parallelization is no silver bullet

Reminder: Two-Phase Commit CoordinatorSubordinate Send prepare Force-write prepare record Send yes or no Wait for all responses Force-write commit or abort Send commit or abort Force-write abort or commit Send ACK Wait for all ACKs Write end record

The cost of synchronization • 2PC: Minimum latency two network roundtrips. • Latency limits Xact throughput:Throughput #Xacts/second = 1/(cost of one run of 2PC) • Lausanne-Shenzhen:9491km * 4; 126ms@speed of light. • Wide-area/cross data-center consistency – 8 Xacts/s ?!? • Consensus >= 2-phase commit.

Xact throughput local vs. cross-datacenter (slide by P.Bailis)

The cost of synchronization • for every network latency, there is a maximum scale at which a synchronized system can run. • Jim Gray: In the late 80ies, for 2PC, optimal throughput was reached with ~50 machines (above that, throughput decreases). • Today the number is higher, but not by much.

Latency and scaling of synchronized parallel systems • SIMD in CPU: does not need to scale, but is implemented in hardware and on one die: ok • Cache coherence in multi-socket servers: a headache for computer architects! • Linear algebra in HPC • very special and easy problem, superfast interconnects, locality • but scaling remains a challenge for designers of supercomputer and linear algebra routines. • Scaling of ad-hoc problems on supercomputers: an open problem, ad-hoc solutions. • Consistency inside data center (Xacts, MPI syncs): <10000 Hz • Cross-data center consistency ~10 Hz • Latency of heterogeneous local batch jobs: <2Hz (HPC, Mapreduce, Spark)

Latency and scaling of synchronized parallel systems #2 Scaling of batch systems: <2Hz HPC jobs <1Hz(?) Map/reduce 0.1 Hz (Synchronziation via disk, slow scheduling) Spark <2 Hz; “Spark Streaming” Note: Hadoop efficiency: takes 80-100 nodes to match single-core performance.

The cost of synchronization Machine 1 Machine 2 a := 1 b := 1 sync a += a+b 2 sync 3 b += a+b sync a += a+b 5 no sync 5 b += a+b Distributed computation needs synchronization. Low-latency stream processing? Asynchronous messageforwarding– data-flow parallelism

Assume we compute each statement once. • Different machines handle statements • Don’t compute until you have received all the msgs you need. • Works! • But requiressynchronized ts oninput stream. • One stream sourceor synch of streamsources! Does streaming/message passing defeat the 2PC lower bound? Machine (1, j) Machine (2, j) a := 1 b := 1 Machine (i, 1) Machine (i, 2) a’ += a+b 2 3 b’’ += a’+b’ Machine (i, 3) a’’ += a’+b’’ 5 Machine (i, 4) 5 Machine (i, 5) b’’’ += a’’+b’’

Repeatedly compute values. • Each msg has a (creation) epoch timestamp • Multiple msg can share timestamp. • Works in this case! • Computes only sums oftwo objects. We knowwhen we have receivedall the msgs we need tomake progress! Does streaming/message passing defeat the 2PC lower bound? Machine (1, j) Machine (2, j) a := 1 b := 1 Machine (i, 1) Machine (i, 2) a’ += a+b 2 3 b’’ += a’+b’ Machine (i, 3) a’’ += a’+b’’ 5 Machine (i, 4) 5 Machine (i, 5) b’’’ += a’’+b’’

Repeatedly compute values. • Each msgs has a (creation) epoch timestamp • Multiple msg can share timestamp. • Notify when no moremessages of a particular tsare to come from a sender. • Requires to wait fornotify() from all sources. • Synch again! • If there is a cyclical dep(same vals read as written),2PC is back! Does streaming/message passing defeat the 2PC lower bound? Machine (1, j) Machine (2, j) a := 1 b := 1 Machine (i, 1) Machine (i, 2) a’ += a+b 2 3 b’’ += a’+b’ Machine (i, 3) a’’ += a’+b’’ 5 Machine (i, 4) 5 Machine (i, 5) b’’’ += a’’+b’’

Streaming+Iteration: Structured time [Naiad] (t) (t, t’) (t) A B E (t, t’+1) (t, t’) t C D (t, t’) (t, t’)

Data-flow parallel systems Popular for real-time analytics. Most popular framework: Apache Storm / Twitter Heron Simple programming model (“bolts” – analogous to map-reduce mappers). Requires nonblocking operators: e.g. symmetric hash-join vs. sort-merge join.

Latency and Data Skew • Data skew: uneven parallelization. • Reasons: • Bad initial parallelization • Uneven blowup of intermediate results • bad hashing in (map/reduce) reshuffling • Occurs both in batch systems such as map-reduce and • Fixes: skew resilient repartitioning, load shedding, … • Node failures look similar.

Paths to (real-time) performance Parallelization Small data (seriously!) <cut> Incrementalization (online/anytime) Specialization

Paths to (real-time) performance Parallelization Small data (seriously!) Incrementalization (online/anytime) <cut> Specialization

Paths to (real-time) performance • Parallelization • Small data (seriously!) • Incrementalization (online/anytime) • Specialization • Hardware: GPU, FPGA, … • Software: compilation • Lots of activity both on the hw and sw fronts at EPFL&ETHZ

Summary • The classical batch job is getting a lot of competition. People need low-latency for a variety of reasons. • Part of a cloud/OpenStack deployment will be used for low-latency work. • Latency is a problem! Virtualization, MS vs Amazon • Distributed computation at low latencies has fundamental limits. • Very hard systems problems, huge design space to explore. • Incrementalization can give asymptotic efficiency improvements. By many orders of magnitude in practice.

Real-time Analytics on Big Data Streams

Real-time Analytics on Big Data Streams

Presentation Transcript

data parallelism

Data Streams

Arduino Real Time Data

Data Streams

REAL-TIME and NEAR REAL-TIME DATA SOURCE

Data Streams

Real-Time Data Warehousing

Using Real-time Data

Data Streams

Data-Streams and Histograms

GPGPU for Real-Time Data Analytics

Parallelism Real Time

Real-Time News Analytics With Semantic Big Data Technologies

Real-Time Speech Analytics: Transforming Voice Data into Opportunities

Data-Streams and Histograms

Real-time data integration

Real-Time Data Integration

Real-Time Data Analytics in Algo Trading Software Development

Data Lake Market Leveraging Data Lakes for Real-Time Analytics and Decision Maki