Big Data Analytics Pipeline Benchmark Proposal

Deep Analytics PipelineA Benchmark Proposal • Milind Bhandarkar • Chief Scientist, Pivotal • (mbhandarkar@gopivotal.com) • (Twitter: @techmilind)

Why “Representative” Benchmarks ? • Benchmarks most relevant if representative of real-world applications • Design, Architect, & Tune systems for broadly applicable workloads • Rule of System Design: Make common tasks fast, other tasks possible

Drivers for Big Data • Ubiquitous Connectivity - Mobile • Sensors Everywhere – Industrial Internet • Content Production - Social

Big Data Sources • Events • Direct - Human Initiated • Indirect - Machine Initiated • Software Sensors (Clickstreams, Locations) • Public Content (blogs, tweets, Status updates, images, videos)

Typical Big Data Use Case • Gain better understanding of “business entities” using their past behavior • Use “All Data” to model entities • Lower costs means even incremental improvements in model is worthwhile

Domains & Business Entities

“User” Modeling • Objective: Determine User-Interests by mining user-activities • Large dimensionality of possible user activities • Typical user has sparse activity vector • Event attributes change over time

User-Modeling Pipeline • Data Acquisition • Sessionization, Normalization • Feature and Target Generation • Model Training • Model Testing • Upload to serving

Data Acquisition

Data Extraction & Sessionization

Normalization

Targets • User-Actions of Interest • Clicks on Ads & Content • Site & Page visits • Conversion Events • Purchases, Quote requests • Sign-Up for membership etc

Features • Summary of user activities over a time-window • Aggregates, moving averages, rates over various time-windows • Incrementally updated

Joining Targets & Features • Target rates very low: 0.01% ~ 1% • First, find users with target in sessions • Filter user activity without targets • Join feature vector with targets

Model Training • Regressions • Boosted Decision Trees • Naive Bayes • Support Vector Machines • Maximum Entropy modeling

Model Testing • Apply models to features in hold-out data • Pleasantly parallel • Find effectiveness of model

Upload to Serving • Upload scores to serving systems • Benchmark large-scale serving systems • NoSQL Systems with rolling batch upserts, and read-mostly access

Proposal: 5 Classes • Tiny (100K entities, 10 events per entity) • Small (1M entities, 10 events per entity) • Medium (10M entities, 100 events per entity) • Large (100M entities, 1000 events per entity) • Huge (1B entities, 1000 events per entity)

Proposal: Publish results for every stage • Data pipelines constructed by mix-and-match of various stages • Different modeling techniques per class • Need to publish performance numbers for every stage

Questions ?

Big Data Analytics Pipeline Benchmark Proposal

Big Data Analytics Pipeline Benchmark Proposal

Presentation Transcript

Proposal of a new Model Course for Baltic Deep-Sea Pilots

Benchmark:

BigBench: Big Data Benchmark Proposal

Benchmark

Benchmark

Performing Multi-Phased Radar Processing with a Very Deep FPGA Pipeline

A Proposal:

B REAKOUT S ESSION - Deep Analytics Pipeline -

BENCHMARK

A Deep Dive into Nagios Analytics

Benchmark

Benchmark

Benchmark

PSY 550 WEEK 7 BENCHMARK RESEARCH PROPOSAL LATEST-GCU

Social Media Analytics Benchmark Report

DNP 955 Topic 8 Benchmark DPI Project Proposal GCU

GCF Pipeline Development Part II - Proposal Development

DNP 955 Topic 8 Benchmark DPI Project Proposal GCU