1 / 50

Making Sense at Scale with Algorithms, Machines & People

Making Sense at Scale with Algorithms, Machines & People. Michael Franklin EECS, Computer Science UC Berkeley. Emory U December 7, 2011. Defining the Big Data Problem. Size. + Complexity. = Answers that don’t meet quality , time and cost requirements.

lance
Télécharger la présentation

Making Sense at Scale with Algorithms, Machines & People

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Making Sense at Scale withAlgorithms, Machines & People Michael Franklin EECS, Computer Science UC Berkeley Emory U December 7, 2011

  2. Defining the Big Data Problem Size + Complexity = Answers that don’t meet quality, timeandcost requirements.

  3. The State of the Art Watson/IBM Algorithms search Machines People

  4. Needed: A Holistic Approach search Algorithms Watson/IBM Machines People

  5. AMP Team • 8 (primary) Faculty at Berkeley • Databases, Machine Learning, Networking, Security, Systems, … • 4 Partner Applications Participatory Sensing Mobile Millenium (Alex Bayen, CivEng) Collective Discovery Opinion Space (Ken Goldberg, IEOR) Urban Planning and Simulation UrbanSim (Paul Waddell, Env. Des.) Cancer Genomics/Personalized Medicine (Taylor Sittler, UCSF)

  6. Big Data Opportunity Slide from David Haussler, UCSC, “Cancer Genomics,” AMP retreat, 5/24/11 • The Cancer Genome Atlas (TCGA) • 20 cancer types x 500 patients each x 1 tumor genome + 1 normal genome = 5 petabytes • David Haussler (UCSC) Datacenter online 12/11? • Intel donate AMP Lab cluster, put it next to TCGA

  7. Berkeley Systems Lab Model Industrial Collaboration: “Two Feet In” Model:

  8. Berkeley Data Analytics System (BDAS) Data Collector Algo/Tools Data Analyst Infra. Builder A Top-to-Bottom Rethinking of the Big Data Analytics Stack integrating Algorithms, Machines, and People Data Source Selector Result Control Center Visualization Analytics Libraries, Data Integration Higher Query Languages / Processing Frameworks Monitoring/Debugging Quality Control Resource Management Crowd Interface Storage Data Collector

  9. Algos: More Data = Better Answers Given an inferential goal and a fixed computational budget, provide a guarantee (supported by an algorithm and an analysis) that the quality of inference will increase monotonically as data accrue (without bound) Estimate true answer Error bars on every answer! # of data points

  10. Towards ML/Systems Co-Design • Some ingredients of a system that can estimate and manage statistical risk: • Distributed bootstrap (bag of little bootstraps; BLB) • Subsampling (stratified) • Active sampling (cf. crowd-sourcing) • Bias estimation (especially with crowd-sourced data) • Distributed optimization • Streaming versions of classical ML algorithms • Streaming distributed bootstrap • These all must be scalable, and robust

  11. Machines Agenda • New software stack to • Effectively manage cluster resources • Effectively extract value out of big data • Projects: • “Datacenter OS” • Extend Mesos distributed resource manager • Common Runtime • Structured, Unstructured, Streaming, Sampling, … • New Processing frameworks & storage systems • E.g., Spark – parallel environment for iterative algorithms • Example: Quicksilver Query Processor • Allows users to navigate tradeoff space (quality, time, and cost) for complex queries

  12. QuickSilver: Where do We Want to Go? “simple” queries on PBs of data take hours Today: Goal: compute complex queries on PBs of data in < x secondswith < y% error Ideal: sub-second arbitrary queries on PBs of data

  13. People Machines + Algorithms data, activity Questions Answers • Make people an integrated part of the system! • Leverage human activity • Leverage human intelligence (crowdsourcing) Use the crowd to: Find missing data Integrate data Make subjective comparisons Recognize patterns Solve problems

  14. Human-Tolerant Computing People throughout the analytics lifecycle? • Inconsistent answer quality • Incentives • Latency & Variance • Open vs. Closed world • Hybrid Human/Machine Design Approaches: • Statistical Methods for error and bias • Quality-conscious Interface design • Cost (time, quality)-based optimization

  15. CROWDSOURCING EXAMPLES

  16. Citizen Science NASA “Clickworkers” circa 2000

  17. Citizen Journalism/Participatory Sensing

  18. Expert Advice

  19. Data collection Freebase

  20. One View of Crowdsourcing From Quinn & Bederson, “Human Computation: A Survey and Taxonomy of a Growing Field”, CHI 2011.

  21. Industry View

  22. Participatory Culture - Explicit

  23. Participatory Culture – Implicit John Murrell: GM SV 9/17/09 …every time we use a Google app or service, we are working on behalf of the search sovereign, creating more content for it to index and monetize or teaching it something potentially useful about our desires, intentions and behavior.

  24. Types of Tasks Inspired by the report: “Paid Crowdsourcing”, Smartsheet.com, 9/15/2009

  25. Amazon Mechanical Turk (AMT)

  26. A Programmable Interface • Amazon Mechanical Turk API • Requestors place Human Intelligence Tasks (HITs)via “createHit()” API • Parameters include: #of replicas, expiration, User Interface,… • Requestors approve jobs and payment • “getAssignments()”, “approveAssignments()” • Workers (a.k.a. “turkers”) choose jobs, do them, get paid

  27. Worker’s View

  28. Requestor’s Veiew

  29. CrowdDB: A Radical New Idea? “The hope is that, in not too many years, human brains and computing machines will be coupled together very tightly, and that the resulting partnership will think as no human brain has ever thought and process data in a way not approached by the information-handling machines we know today.”

  30. Problem: DB-hard Queries SELECT Market_Cap From Companies Where Company_Name = “IBM” Number of Rows: 0 Problem: Entity Resolution

  31. DB-hard Queries SELECT Market_Cap From Companies Where Company_Name = “Apple” Number of Rows: 0 Problem: Closed World Assumption

  32. DB-hard Queries SELECT Top_1 Image From Pictures Where Topic = “Business Success” Order By Relevance Number of Rows: 0 Problem: Subjective Comparison

  33. CrowdDB M. Franklin et al. CrowdDB: Answering Queries with Crowdsourcing, SIGMOD 2011 Use the crowd to answer DB-hard queries Where to use the crowd: • Find missing data • Make subjective comparisons • Recognize patterns But not: • Anything the computer already does well

  34. CrowdSQL DDL Extensions: Crowdsourced columns Crowdsourced tables CREATE CROWDTABLE department ( university STRING, department STRING, phone_noSTRING) PRIMARY KEY (university, department); CREATE TABLE company ( name STRING PRIMARY KEY, hq_addressCROWD STRING); DML Extensions: CrowdEqual: CROWDORDER operators (currently UDFs): SELECT p FROM picture WHERE subject = "Golden Gate Bridge" ORDER BY CROWDORDER(p, "Which pic shows better %subject"); SELECT * FROM companies WHERE Name ~=“Big Blue”

  35. User Interface Generation A clear UI is key to response time and answer quality. Can leverage the SQL Schema to auto-generate UI (e.g., Oracle Forms, etc.)

  36. Subjective Comparisons MTFunction • implements the CROWDEQUAL and CROWDORDER comparison • Takes some description and a type (equal, order) parameter • Quality control again based on majority vote • Ordering can be further optimized (e.g., Three-way comparisions vs. Two-way comparisons)

  37. Does it Work?: Picture ordering Query: SELECT p FROM picture WHERE subject = "Golden Gate Bridge" ORDER BY CROWDORDER(p, "Which pic shows better %subject"); Data-Size: 30 subject areas, with 8 pictures each Batching: 4 orderings per HIT Replication: 3 Assignments per HIT Price: 1 cent per HIT (turker-votes, turker-ranking, expert-ranking)

  38. User Interface vs. Quality To get information about Professors and their Departments… (Department first) (Professor first) (De-normalized Probe) ≈20% Error-Rate ≈80% Error-Rate ≈10% Error-Rate

  39. Can we build a “Crowd Optimizer”? Select * From Restaurant Where city = …

  40. Price vs. Response Time 5 Assignments, 100 HITs

  41. Turker Affinity and Errors Turker Rank [Franklin, Kossmann, Kraska, Ramesh, Xin: CrowdDB: Answering Queries with Crowdsourcing. SIGMOD,2011]

  42. Turker Affinity and Errors Turker Rank [Franklin, Kossmann, Kraska, Ramesh, Xin: CrowdDB: Answering Queries with Crowdsourcing. SIGMOD,2011]

  43. Can we build a “Crowd Optimizer”? Select * From Restaurant Where city = … Hmm... I smell lab rat material. be very wary of doing any work for this requester… I would do work for this requester again. I advise not clicking on his “information about restaurants” hits. This guy should be shunned.

  44. Processor Relations? Tim KlasKraska HIT Group » I recently did 299 HITs for this requester.… Of the 299 HITs I completed, 11 of them were rejected without any reason being given. Prior to this I only had 14 rejections, a .2% rejection rate. I currently have 8522 submitted HITs, with a .3% rejection rate after the rejections from this requester (25 total rejections). I have attempted to contact the requester and will update if I receive a response. Until then be very wary of doing any work for this requester, as it appears that they are rejecting about 1 in every 27 HITs being submitted.     posted by … fair:2 / 5 fast:4 / 5    pay:2 / 5 comm:0 / 5

  45. Open World = No Semantics? Select * From Crowd_Sourced_Table • What does the above query return? • In the old world, it was just a table scan. • In the crowdsourced world: • Which answers are “Right”? • When to stop? • BioStatistics to the Rescue????

  46. Open World = No Semantics? Select * From Crowd_Sourced_Table Species Acquisition Curve for Data

  47. Why the Crowd is Different • Classical Approaches don’t quite work • Incorrect Answers • Chicago is not a State • How to spell Mississippi? • Streakers vs. Samplers • Individuals sample without replacement • and Worker/Task affinity • List Walking • e.g., Google “Ice Cream Flavors” • The above can be detected and mitigated to some extent.

  48. How Can You Trust the Crowd? • General Techniques • Approval Rate / Demographic Restrictions • Qualification Test • Gold Sets/Honey Pots • Redundancy • Verification/Review • Justification/Automatic Verification • Query Specific Techniques • Worker Relationship Management

  49. Making Sense at Scale • Data Size is only part of the challenge • Balance quality, cost and timefor a given problem • To address, we must Holistically integrate Algorithms, Machines, and People amplab.cs.berkeley.edu

More Related