Making Sense at Scale with Algorithms, Machines & People

Making Sense at Scale withAlgorithms, Machines & People Michael Franklin EECS, Computer Science UC Berkeley Emory U December 7, 2011

Defining the Big Data Problem Size + Complexity = Answers that don’t meet quality, timeandcost requirements.

The State of the Art Watson/IBM Algorithms search Machines People

Needed: A Holistic Approach search Algorithms Watson/IBM Machines People

AMP Team • 8 (primary) Faculty at Berkeley • Databases, Machine Learning, Networking, Security, Systems, … • 4 Partner Applications Participatory Sensing Mobile Millenium (Alex Bayen, CivEng) Collective Discovery Opinion Space (Ken Goldberg, IEOR) Urban Planning and Simulation UrbanSim (Paul Waddell, Env. Des.) Cancer Genomics/Personalized Medicine (Taylor Sittler, UCSF)

Big Data Opportunity Slide from David Haussler, UCSC, “Cancer Genomics,” AMP retreat, 5/24/11 • The Cancer Genome Atlas (TCGA) • 20 cancer types x 500 patients each x 1 tumor genome + 1 normal genome = 5 petabytes • David Haussler (UCSC) Datacenter online 12/11? • Intel donate AMP Lab cluster, put it next to TCGA

Berkeley Systems Lab Model Industrial Collaboration: “Two Feet In” Model:

Berkeley Data Analytics System (BDAS) Data Collector Algo/Tools Data Analyst Infra. Builder A Top-to-Bottom Rethinking of the Big Data Analytics Stack integrating Algorithms, Machines, and People Data Source Selector Result Control Center Visualization Analytics Libraries, Data Integration Higher Query Languages / Processing Frameworks Monitoring/Debugging Quality Control Resource Management Crowd Interface Storage Data Collector

Algos: More Data = Better Answers Given an inferential goal and a fixed computational budget, provide a guarantee (supported by an algorithm and an analysis) that the quality of inference will increase monotonically as data accrue (without bound) Estimate true answer Error bars on every answer! # of data points

Towards ML/Systems Co-Design • Some ingredients of a system that can estimate and manage statistical risk: • Distributed bootstrap (bag of little bootstraps; BLB) • Subsampling (stratified) • Active sampling (cf. crowd-sourcing) • Bias estimation (especially with crowd-sourced data) • Distributed optimization • Streaming versions of classical ML algorithms • Streaming distributed bootstrap • These all must be scalable, and robust

Machines Agenda • New software stack to • Effectively manage cluster resources • Effectively extract value out of big data • Projects: • “Datacenter OS” • Extend Mesos distributed resource manager • Common Runtime • Structured, Unstructured, Streaming, Sampling, … • New Processing frameworks & storage systems • E.g., Spark – parallel environment for iterative algorithms • Example: Quicksilver Query Processor • Allows users to navigate tradeoff space (quality, time, and cost) for complex queries

QuickSilver: Where do We Want to Go? “simple” queries on PBs of data take hours Today: Goal: compute complex queries on PBs of data in < x secondswith < y% error Ideal: sub-second arbitrary queries on PBs of data

People Machines + Algorithms data, activity Questions Answers • Make people an integrated part of the system! • Leverage human activity • Leverage human intelligence (crowdsourcing) Use the crowd to: Find missing data Integrate data Make subjective comparisons Recognize patterns Solve problems

Human-Tolerant Computing People throughout the analytics lifecycle? • Inconsistent answer quality • Incentives • Latency & Variance • Open vs. Closed world • Hybrid Human/Machine Design Approaches: • Statistical Methods for error and bias • Quality-conscious Interface design • Cost (time, quality)-based optimization

CROWDSOURCING EXAMPLES

Citizen Science NASA “Clickworkers” circa 2000

Citizen Journalism/Participatory Sensing

Expert Advice

Data collection Freebase

One View of Crowdsourcing From Quinn & Bederson, “Human Computation: A Survey and Taxonomy of a Growing Field”, CHI 2011.

Industry View

Participatory Culture - Explicit

Participatory Culture – Implicit John Murrell: GM SV 9/17/09 …every time we use a Google app or service, we are working on behalf of the search sovereign, creating more content for it to index and monetize or teaching it something potentially useful about our desires, intentions and behavior.

Types of Tasks Inspired by the report: “Paid Crowdsourcing”, Smartsheet.com, 9/15/2009

Amazon Mechanical Turk (AMT)

A Programmable Interface • Amazon Mechanical Turk API • Requestors place Human Intelligence Tasks (HITs)via “createHit()” API • Parameters include: #of replicas, expiration, User Interface,… • Requestors approve jobs and payment • “getAssignments()”, “approveAssignments()” • Workers (a.k.a. “turkers”) choose jobs, do them, get paid

Worker’s View

Requestor’s Veiew

CrowdDB: A Radical New Idea? “The hope is that, in not too many years, human brains and computing machines will be coupled together very tightly, and that the resulting partnership will think as no human brain has ever thought and process data in a way not approached by the information-handling machines we know today.”

Problem: DB-hard Queries SELECT Market_Cap From Companies Where Company_Name = “IBM” Number of Rows: 0 Problem: Entity Resolution

DB-hard Queries SELECT Market_Cap From Companies Where Company_Name = “Apple” Number of Rows: 0 Problem: Closed World Assumption

DB-hard Queries SELECT Top_1 Image From Pictures Where Topic = “Business Success” Order By Relevance Number of Rows: 0 Problem: Subjective Comparison

CrowdDB M. Franklin et al. CrowdDB: Answering Queries with Crowdsourcing, SIGMOD 2011 Use the crowd to answer DB-hard queries Where to use the crowd: • Find missing data • Make subjective comparisons • Recognize patterns But not: • Anything the computer already does well

CrowdSQL DDL Extensions: Crowdsourced columns Crowdsourced tables CREATE CROWDTABLE department ( university STRING, department STRING, phone_noSTRING) PRIMARY KEY (university, department); CREATE TABLE company ( name STRING PRIMARY KEY, hq_addressCROWD STRING); DML Extensions: CrowdEqual: CROWDORDER operators (currently UDFs): SELECT p FROM picture WHERE subject = "Golden Gate Bridge" ORDER BY CROWDORDER(p, "Which pic shows better %subject"); SELECT * FROM companies WHERE Name ~=“Big Blue”

User Interface Generation A clear UI is key to response time and answer quality. Can leverage the SQL Schema to auto-generate UI (e.g., Oracle Forms, etc.)

Subjective Comparisons MTFunction • implements the CROWDEQUAL and CROWDORDER comparison • Takes some description and a type (equal, order) parameter • Quality control again based on majority vote • Ordering can be further optimized (e.g., Three-way comparisions vs. Two-way comparisons)

Does it Work?: Picture ordering Query: SELECT p FROM picture WHERE subject = "Golden Gate Bridge" ORDER BY CROWDORDER(p, "Which pic shows better %subject"); Data-Size: 30 subject areas, with 8 pictures each Batching: 4 orderings per HIT Replication: 3 Assignments per HIT Price: 1 cent per HIT (turker-votes, turker-ranking, expert-ranking)

User Interface vs. Quality To get information about Professors and their Departments… (Department first) (Professor first) (De-normalized Probe) ≈20% Error-Rate ≈80% Error-Rate ≈10% Error-Rate

Can we build a “Crowd Optimizer”? Select * From Restaurant Where city = …

Price vs. Response Time 5 Assignments, 100 HITs

Turker Affinity and Errors Turker Rank [Franklin, Kossmann, Kraska, Ramesh, Xin: CrowdDB: Answering Queries with Crowdsourcing. SIGMOD,2011]

Can we build a “Crowd Optimizer”? Select * From Restaurant Where city = … Hmm... I smell lab rat material. be very wary of doing any work for this requester… I would do work for this requester again. I advise not clicking on his “information about restaurants” hits. This guy should be shunned.

Processor Relations? Tim KlasKraska HIT Group » I recently did 299 HITs for this requester.… Of the 299 HITs I completed, 11 of them were rejected without any reason being given. Prior to this I only had 14 rejections, a .2% rejection rate. I currently have 8522 submitted HITs, with a .3% rejection rate after the rejections from this requester (25 total rejections). I have attempted to contact the requester and will update if I receive a response. Until then be very wary of doing any work for this requester, as it appears that they are rejecting about 1 in every 27 HITs being submitted. posted by … fair:2 / 5 fast:4 / 5 pay:2 / 5 comm:0 / 5

Open World = No Semantics? Select * From Crowd_Sourced_Table • What does the above query return? • In the old world, it was just a table scan. • In the crowdsourced world: • Which answers are “Right”? • When to stop? • BioStatistics to the Rescue????

Open World = No Semantics? Select * From Crowd_Sourced_Table Species Acquisition Curve for Data

Why the Crowd is Different • Classical Approaches don’t quite work • Incorrect Answers • Chicago is not a State • How to spell Mississippi? • Streakers vs. Samplers • Individuals sample without replacement • and Worker/Task affinity • List Walking • e.g., Google “Ice Cream Flavors” • The above can be detected and mitigated to some extent.

How Can You Trust the Crowd? • General Techniques • Approval Rate / Demographic Restrictions • Qualification Test • Gold Sets/Honey Pots • Redundancy • Verification/Review • Justification/Automatic Verification • Query Specific Techniques • Worker Relationship Management

Making Sense at Scale • Data Size is only part of the challenge • Balance quality, cost and timefor a given problem • To address, we must Holistically integrate Algorithms, Machines, and People amplab.cs.berkeley.edu

Making Sense at Scale with Algorithms, Machines & People