200 likes | 304 Vues
This proposal outlines the need for representative benchmarks in deep analytics pipelines and the importance of designing systems for real-world applications. It discusses typical big data use cases, domains, user modeling, and the pipeline process from data acquisition to model testing. The proposal suggests five classes for benchmarking at different scales and emphasizes the publication of results at every stage.
E N D
Deep Analytics PipelineA Benchmark Proposal • Milind Bhandarkar • Chief Scientist, Pivotal • (mbhandarkar@gopivotal.com) • (Twitter: @techmilind)
Why “Representative” Benchmarks ? • Benchmarks most relevant if representative of real-world applications • Design, Architect, & Tune systems for broadly applicable workloads • Rule of System Design: Make common tasks fast, other tasks possible
Drivers for Big Data • Ubiquitous Connectivity - Mobile • Sensors Everywhere – Industrial Internet • Content Production - Social
Big Data Sources • Events • Direct - Human Initiated • Indirect - Machine Initiated • Software Sensors (Clickstreams, Locations) • Public Content (blogs, tweets, Status updates, images, videos)
Typical Big Data Use Case • Gain better understanding of “business entities” using their past behavior • Use “All Data” to model entities • Lower costs means even incremental improvements in model is worthwhile
“User” Modeling • Objective: Determine User-Interests by mining user-activities • Large dimensionality of possible user activities • Typical user has sparse activity vector • Event attributes change over time
User-Modeling Pipeline • Data Acquisition • Sessionization, Normalization • Feature and Target Generation • Model Training • Model Testing • Upload to serving
Targets • User-Actions of Interest • Clicks on Ads & Content • Site & Page visits • Conversion Events • Purchases, Quote requests • Sign-Up for membership etc
Features • Summary of user activities over a time-window • Aggregates, moving averages, rates over various time-windows • Incrementally updated
Joining Targets & Features • Target rates very low: 0.01% ~ 1% • First, find users with target in sessions • Filter user activity without targets • Join feature vector with targets
Model Training • Regressions • Boosted Decision Trees • Naive Bayes • Support Vector Machines • Maximum Entropy modeling
Model Testing • Apply models to features in hold-out data • Pleasantly parallel • Find effectiveness of model
Upload to Serving • Upload scores to serving systems • Benchmark large-scale serving systems • NoSQL Systems with rolling batch upserts, and read-mostly access
Proposal: 5 Classes • Tiny (100K entities, 10 events per entity) • Small (1M entities, 10 events per entity) • Medium (10M entities, 100 events per entity) • Large (100M entities, 1000 events per entity) • Huge (1B entities, 1000 events per entity)
Proposal: Publish results for every stage • Data pipelines constructed by mix-and-match of various stages • Different modeling techniques per class • Need to publish performance numbers for every stage