Hazy: Making Statistical Applications Easier to Build and Maintain

Hazy: Making Statistical Applications Easier to Build and Maintain Christopher Ré BigLearn Collaborators listed throughout

Big data is the future. Big data is great for vendors and consulting $$$, but is ‘Big’ the heart of the problem?

How big is `big’? • Is it a GB? • That fits on your phone… • but LeCun & Farabethas same spirit. • Is it a TB? • TB is big on a Hadoop… • but is main memory in many apps • Is it a PB? • Do you have that problem? NB: I love Hadoop, Y! and Cloudera. “Let’s end Peta-philia” – John Doyle Maybe something other than size is the common thread?

Big Shifts Underlying Big Data • More signal (data) beats a more complex model nine times out of ten. -- This is why we acquire big data. • Move computation to the data – not data to to the computation. • My bet: The key is the ability to quickly sling signal together – where data live.

Hazy: Making Statistical Applications Easier to Build and Maintain

Two Trends that Drive Hazy Data in unprecedented number of formats 2. Arms race for deeper understanding of data Statistical tools attack both 1. and 2. Hazy = statistical + data management techniques Hazy’s Goal: Understand common patterns in deploying & maintaining statistical tools on data.

Hazy’s Thesis The next breakthrough in data analysis may not be a new data analysis algorithm… …but may be in the ability to rapidly combine, deploy, and maintain existing algorithms.

Outline Three Application Areas for Hazy Victor/Bismarck (in Parallel RDBMS) HogWild! (Shared Memory) Other Hazy Projects

Data constantly generated on the Web, Twitter, Blogs, and Facebook Extract and Classify sentiment about products, ad campaigns, and customer facing entities. Often deployed statistical tools: Extraction (CRFs) & Classification (SVM). DARPA Machine Reading: “List members of the Brazilian Olympic Team in this corpus with years of membership”

Simplified View of MR Data Flow Goals: High quality. Low Effort. Easy to test. Entity Linking Slot Filling Pairs of Entity Features Mention and Entity Features Infrastructure (Felix) ClueWeb Freebase News Corpus 500M Docs or 15TB

FengNiu, Ce Zhang, Josh Slauson: • www.cs.wisc.edu/hazy/felix Two Key Features our system, Felix • Felix allows many best-in-breed algorithms to work jointlytogether. - Extraction (CRF), Classification (SVM) 2. Simple SQL-like rule-language (Markov Logic) • Felix allows us to leverage Wikipedia, Freebase, ClueWeb in one high-level language (TBs of data) Key: Marry simple robust statistical tools with flexible rules – and be able to scale to TBs. TAC-KBP F1=39. SotA Systems F1=20

Rough Demo!

A physicist interpolates sensor readings and uses regression to more deeply understand their data

Digital Optical Module (DOM)

Workflow of IceCube In Madison: Lots of data analysis. Via satellite: Interesting DOM readings At Pole: Algorithm says “Interesting!” In Ice: Detection occurs.

Mark Wellons, Ben Recht, Francis Halzen, and Gary Hill A Key Step: Detecting Track Rules + Simple Regression Mark’s code to the South Pole in 2012. NB: Many data analysts are ex-physicists Mathematical structure used to track neutrinos similar to labeling text. (Experts: Both are Convex Programs)

Social Scientists @ UW OCR and Speech A social scientist wants to extract the frequency of synonyms of English words in 18th century texts. Getting text is challenging! (statistical model errors) OCR & Speech Output of speech and OCR models similar to text-labeling models (transducers) Want to unify content management systems with RDBMS.

Application Takeaways Statistical processing on large data enables a wide variety of new applications. Goal: Develop the ability to rapidly combine, deploy, and maintain existing algorithms. Algorithm (IGMs) that solves bold models close to data in (1)an RDBMS & (2) Shared-Memory.

Outline Two Application Areas for Hazy Victor/Bismarck (in Parallel RDBMS) HogWild! (Shared Memory) Other Hazy Projects

Classify publications by subject area

Example: Linear Models Label papers as DB Papers or Non-DB Papers 1 2 DB Papers x 1. Map each papers to Rd Non-DB Papers 2. Classify via plane 3 5 4 Question is: How do we pick a good plane (x)?

Example: Linear Models Input: Labeled papers. Each point labeled as DB (+) or not (-) + - x says DB Papers x Idea: Let’s score each plane x says Non-DB - - + Idea: minimize misclassification. yi is a paper vector and its label Instead: change f to a smooth function of distance to plane. e.g., the squared distance (least squares), hinge loss (svm), log loss (logistic regression). Different f = different model.

Experts: f and P are convex Framework: Inverse Problems Examples: • Paper Classification: yi is the paper with its label • Neutrino Tracking: yi is a DOM (sensor) reading • CRFs: yi is (document, labeling) • Netflix: yi is (user,movie,rating), Claim: General data analysis technique that is no more difficult to compute than a SQL AVG.

Background: Gradient Methods Gradient Methods: Iterative. 1. Start at current x, 2. Take gradient at x, 3. Move in opposite direction F(x)

Incremental Gradient Methods Gradient Methods: Iterative. 1. Start at current x, 2. Approximate gradient at x by selecting j in {1…N}, 3. Move in opposite direction Select a single data item to approximate gradient

Incremental Gradient Methods (iGMs) Why use iGMs? iGMs converge to an optimal for many problems, but the real reason is: iGMs are fast. Technical Connection: iGM processing isomorphic to computing a SQL AVG (x is the accumulator, G is an expression on a single tuple) Solve statistical models in an RDBMS for free

iGMs ≈ User Defined Aggregates Most RDBMS 3 Functions in a UDA • Initialize (State) • Transition(State, data) • Terminate(State) State in R2 : (# terms, running total) Transition( (n, T), d ) = (n, T) + (1,d) Terminate( (n,T) ) = T/n AVG State in Rd : the model x Transition( x, yj) = x - G(x,yj) Terminate( x ) = x Gradient

MADLib & Oracle versions coming. Terminology from Gray et al Some subtle differences… UDAs typically commutative and algebraic IGMs are formally neither, but are morally both Morally Commutative: Different orders give different exact result, but any order converges to same result. Morally Algebraic: One can average models, so morally algebraic [Zinkevich et al 10]. Key: IGMs work (in parallel) off the shelf in a UDA. (This is how we do Logistic Regression on the Web)

Hogwild! [Niu, Recht, Ré, & Wright NIPS 11] Code on Web.

Student who did real work in red. Other Hazy work • Dealing with data in different formats • Staccato: Arun Kumar. Query OCR documents [PODS 10 & VLDB 12] • Hazy goes to the South Pole: Mark WellonsTrigger Software for IceCube • Machine Learning Algorithms • Hogwild run SGD on convex problems with no locking [NIPS 11] • Jellyfish faster than Hogwild! on Matrix Completion [Opt Online 11] • Maintain Supervised Techniques on Evolving Corpora • iClassify: M. LeventKoc. Maintain classifiers on evolving examples [VLDB 11] • CRFLex: Aaron Feng. Maintain CRFs on Evolving Corpora [ICDE 12] • Populate Knowledge Bases from Text • Tuffy: SQL + weights to combine multiple models [VLDB 11] • Felix: Running Markov Logic on Millions of Documents [Preprint ARXIV 11] • FengNiu, Ce Zhang, Josh Slauson,and Adel Ardalan • Infrastructure with Mike Cafarella (Michigan) • Manimal: Optimizing MapReduce with Relational Operations [VLDB 11]

Conclusion Many statistical data analyses are no more complicated to compute than a SQL AVG. Hogwild! approach to multicore parallelism has no locking, but good speedup Code, data, and papers: www.cs.wisc.edu/hazy

Hazy: Making Statistical Applications Easier to Build and Maintain

Hazy: Making Statistical Applications Easier to Build and Maintain

Presentation Transcript

Structured Decision Processes

Basics for FIRST Robotics Build Season

Practical Statistical Relational Learning

Practical Applications of Statistical Methods in the Clinical Laboratory

Statistical vs Clinical Significance

Part 2 Statistical Mechanics

4-1 Statistical Inference

The advantage of a low loader:

Epidemiologic and Research Applications in Community Nursing

Useful Statistical Tools

Corpora and Statistical Methods

Some Basic Statistical Concepts

Laboratory for Interdisciplinary Statistical Analysis

Statistical Inference

Part 2 Statistical Mechanics

Applications of Extended Ensemble Monte Carlo

Chapter 8 Statistical inference: Significance Tests About Hypotheses

Practical Statistical Relational AI

The advantage of a low loader:

Chapter 4: Making Statistical Inferences from Samples