Online Machine Learning with Distributed In-memory Clusters

Online Machine Learning with Distributed In-memory Clusters • Arshak Navruzyan, VP Product • Argyle Data • Acknowledgements • Nina Zumel, PhD – Win-Vector LLC

Contents • Business use-cases • DevOps / Continuous Deployment • Quality of Service (QOS) • Quality of Experience (QoE) • Machine learning methods for anomaly detection • Nearest Neighbor (NN) • Isolation Forest (iForest) • Random Forests (RF)

Ecommerce operator use case • Large distributed search farm logs usage data into in-memory grid • Mixed measurements • Continuous: query latency, resource utilization, etc. • Categorical: client IP, server IP, code version, etc. • ~1-3 TB of search logs every 12 hours • Find the anomalies • Data isn’t easy to characterize due to size • Anomalies are across multiple variables (combination of server, code version, latency) • No labeled data is available • High rate of false-positives at this scale is a flood of data • Very few ML methods operate at this scale

In-memory grid provides live view of the data • Distributed SQL store • Read / write optimized • Automatic placement (sharding) • Fast ingest (million inserts per sec.) • Fast aggregations (billion rows per sec.) • Holds ~30 days of data online • Insert also sets time-to-live (ttl) • Simple monitoring tool • D3.js based horizon graphs • Socket.io / Node.js

Isn’t there an open source project that does this? • Etsy’s continuous deployment problem • 1.5b page views, 1M users, $117m in goods sold • 250+ commiters, everyone deploys! • 30+ deploys to production a day • ~8 commits per deploy • How do you NOT break production?

How do you monitor such an environment like Etsy? • Usual suspects: IT monitoring tools like Ganglia, Nagios • But … “Not all things that break throw errors” • Etsy’s Kale • StatsD - StatsD::increment(“foo.bar”) • Skyline – Real-time anomaly detection system • Oculus – Anomaly correlation system using dynamic warping

Skyline & Oculus

Did they solve the problem? • Skyline’s Anomaly Detection • Basic principle: “Metric is anomalous if it’s latest datapoint is over three standard deviations above the moving average” • Ensemble of methods from tailing average, median absolute deviation, grubbs, stdev from moving average, least squares (3 sigma), Kolmogorov-Smirnov test, etc. • Results get better with ensemble technqiue but still very noisy • Non-normal distributions • Spike influence • Periodicity / seasonality

Machine learning based anomaly detection • “Benchmarking algorithms for detecting anomalies in large datasets” -Uriel Carrasquilla (2010)

Y N1 o1 O3 o2 N2 X Nearest-neighbor approach • N1 and N2 are regions of normal behaviour • Points o1 and o2 are anomalies • Points in region O3 are anomalies • Advantage • No need for assumptions about data distribution • No need to have pre-labelled anomalies • Supports categorical as wells as continuous variables • Drawbacks • Computationally expensive – quadratic in data volume (every point has to be compared to every other point) • Existing implementations are batch-oriented • “Anomaly Detection: A Tutorial” - A. Banerjee, V. Chandola, V. Kumar, J. Srivastava • (SIAM International Conference on Data Mining 2008)

Avoiding the quadratic computation time of NN • Bay & Schwabacher SIGKDD 2003 • Outliners in Near Linear Time with Randomization and a Simple Pruning Rule • Only anomalies are compared to every point in the data set • If anomalies are rare, points only get compared to a small constant number of points • Challenges that remain • Batch learner won’t work for our scale • How would this work in a sharded environment? • Linear time is still long for “Big Data”

Nearest-neighbor as an online learner • ORCA concepts can be extended to online learning • Randomly pull a small sample of the data that is representative of a time period • Compare the streaming data (latest observations) to the representative set • Representative set moves with time (sliding window) so that noon is compared to typical noon (this addresses periodicity) • Other enhancements • Every node needs the full reference set • Distributed cache moves reference set between nodes • Locality sensitive hashing

Random partitioning approach • “Isolation Forest” – Liu IEEE (2008) • When building a tree, anomalies are likely to be isolated closer to the root of the tree; whereas normal points appear deeper in the tree structure • No need to profile normal data points • No distance or density measures • Gap • No support for categorical attributes • Lacks explanatory power • No obvious way to turn into an online method

Ensemble of trees as an online learner • Random Forest (RF) basics • Take the square root of the number of attributes and build that many trees • Random selection (with replacement) of observations (in-bag/out-of-bag) • Random selection of variables • “Online Random Forests” (ORF) algorithm of Saffari et al., ICCV-OLCV 2009 • How do you perform bagging online? (Oza “Online bagging and Boosting”) • Probability of an individual tree seeing an observation in batch learning mode is Poisson (λ = 1) • How to grow random trees on-the-fly? • Continuously measure info-gain (or gini) of a potential split

Offline vs. Online Training

To summarize • Machine learning can produce more accurate anomaly detection than the stats approach • Nearest Neighbor and Random Forests may be adapted to an online learner • In-memory distributed processing can improve the performance of such algorithms • Supervised methods for classification and regression will work similarly (we think …)

Thank you! www.linkedin.com/in/arshak

Online Machine Learning with Distributed In-memory Clusters