1 / 72

big data and the cloud: programming futures

big data and the cloud: programming futures. joe hellerstein. roadmap. status report analytics scalable systems research calm < ~ bloom d p. roadmap. status report analytics scalable systems research calm < ~ bloom d p. In the Days of Kings and Priests.

azizi
Télécharger la présentation

big data and the cloud: programming futures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. big data and the cloud:programming futures joehellerstein

  2. roadmap • status report • analytics • scalable systems • research • calm <~ bloom • dp

  3. roadmap • status report • analytics • scalable systems • research • calm <~ bloom • dp

  4. In the Days of Kings and Priests • Computers and Data: Crown Jewels • Executives depend on computers • But cannot work with them directly • The DBA “Priesthood” • And their Acronymia:EDW, BI, OLAP • The “architected” EDW “There is no point in bringing data … into the data warehouse environment without integrating it.” — Bill Inmon, Building the Data Warehouse, 2005

  5. New Realities • The quest for knowledge used to begin with grand theories. • Now it begins with massive amounts of data. • Welcome to the Petabyte Age. • TB disks < $100 • Everything is data • Rise of data-driven culture • Very publicly espoused by Google, Wired, etc. • Sloan Digital Sky Survey, Terraserver, etc.

  6. The New Practitioners “Looking for a career where your services will be in high demand? … Provide a scarce, complementary service to something that is getting ubiquitous and cheap. the sexy job in the next ten years will be statisticians • So what’s ubiquitous and cheap? Data.And what is complementary to data? Analysis. Hal Varian, UC Berkeley, Chief Economist @ Google

  7. The New Practitioners • Aggressively Datavorous • Statistically savvy • Diverse in training, tools

  8. MAD Skills [Cohen, et al. VLDB 09] • Magnetic • attract data and practitioners • Agile • rapid iteration: ingest, analyze, productionalize • Deep • sophisticated analytics in Big Data

  9. Dev tools 4 analytics: reality • Current focus: engines/languages for scalable analytics • Scalable analytics algorithms are a small % of analyst’s life

  10. Dev tools 4 analytics: reality • Current focus: engines/languages for scalable analytics • Scalable analytics algorithms are a small % of analyst’s life • focus on Deepnot enough on Agile • not enough on Magnetic • dp visualization software development data product management collaboration/ networking

  11. Analytics Coding Landscape • Single-node stat packages (R, Matlab, SAS, etc.) • domain-specific languages for linear algebra and statistics • diverse set of open libraries (e.g. CRAN library for R) • scalability limits: in-core, no parallelism • MapReduce ecosystem • Google, MS Dryad, Hadoop open source • low-level single-node coding (Java), easy data-parallelization • SQL-like convenience languages above (Hive, Pig) • emerging open analytics toolkits (Mahout, Pregel) • SQL + extensions (user-defined functions) • more powerful than many realize • declarative coding, easy data-parallelism • poor support for extension developers (varies by vendor) • emerging open analytics toolkits (MADlib, Hazy)

  12. Analytics Takeaways • little real dev difference between mapreduce and SQL • hadoop has more energetic dev tools development • SQL provides more breadth (of function, install base, HR) • lines are blurring • serious barrier: porting the R/SAS/Matlab ecosystem • will take a decade to develop data-parallel equivalent to CRAN • algorithmic challenge, not just a coding challenge • no shortcut here (at least for MMP) • in sum • analytics will be a “swiss army knife” approach for years to come • think portfolio. foster community, open libraries (MADlib/Mahout)

  13. roadmap • status report • analytics • scalable systems • research • calm <~ bloom • dp

  14. big systems c. 2011 • features: • data-centric • distributed • highly available • scalable/elastic • lots of new/custom code • programming is becoming hard2 • (parallelism + asynchrony + failure) × (software engineering)

  15. root cause of hardness • order is pervasive in the von neumann model • state: an ordered array of cells • logic: an ordered array of instructions • terrible match for distributed systems

  16. typical solution: shared storage • distributed storage replaces RAM • imposes/enforces order • e.g. via transactions or other consistency mechanisms • shift: data-centric development • storage is not persistence — it is a programming model • this has always been true • the cloud makes it pervasive

  17. Dropping ACID? • early exposition: the “transaction concept” [Gray VLDB 1981] • many think distributed ACID transactions are infeasible today • cross-site transactions ⇒ coordination • ⇒ waiting • ⇒ queue buildups • ⇒ unpredictable problems • a major lesson of Internet companies: Brewer’s “CAP theorem” • though implications being revisited • by now, this lesson is kool-aid in the open source community…

  18. NoSQL • “not only SQL” • really not about SQL per se. • focus on two things: • distributed storage with “loose consistency”, not ACID. • data models that are simpler than SQL schemas • key/value stores, documents • i.e. similar to distributed memory! • examples • BigTable (Google), Hbase (Yahoo/Hadoop), Cassandra (Facebook/DataStax), Sherpa (Yahoo), Dynamo (Amazon), Voldemort (LinkedIn), … • cloud services (AppEngine, Azure)

  19. Homework puzzle Given: • use storage layer for distributed coordination (order) • use NoSQL’s loose consistency for availability Q: how do programmers reason about order and correctness?

  20. Homework puzzle Given: • use storage layer for distributed coordination (order) • use NoSQL’s loose consistency for availability Q: how do programmers reason about order and correctness? A: very carefully.

  21. correctness? ACID loose consistency app-specific correctness via design maxims semantic assertions custom compensation • general correctness via theoretical foundations • read/write: serializability • coordination/consensus concerns: hard to trust, test concerns: latency, availability

  22. the shift application logic application logic system infrastructure system infrastructure quicksand theoretical foundation

  23. a vacuum here • state of the art: each app reasons about consistency • e.g. by making use of a locking service (a la Apache Zookeeper) • e.g. by reasoning about “eventual consistency” of the storage system • this is, arguably, hard3 • (sweng) * (distribution) * (false abstractions) • don’t take my word for it • Gunawi’sFATE uncovered 16 fault-recovery bugs in Hadoop FS [NSDI ’11] • focus on storage systemsnot enough on developers • CALM <~ bloom

  24. roadmap • status report • analytics • scalable systems • research • calm <~ bloom • dp

  25. calm <~ bloom disorderly programming for distributed systems

  26. BOOM team ras bodik joe hellerstein peter alvaro neil conway bill marczak haryadi gunawi thibaud hottelier

  27. desire: best of both worlds application logic • theoretical foundation for correctness under loose consistency • embodiment of theory in a programming framework theoretical foundation system infrastructure quicksand

  28. our approach • disorderly programming • state: unordered collections • logic: unordered statements • implications • default: partitioning, concurrency • ordering (data, logic) explicit, special-case • but can this make ordering decisions simpler?

  29. progress • CALM consistency (maxims ⇒ theorems) • Bloom language (theorems ⇒ programming)

  30. CALM

  31. monotonic code non-monotonic code monotonicity • info accumulation • the more you know, the more you know • e.g. map, filter, join • belief revision • new inputs canchange your mind;need to “seal” input • e.g. counts, state update

  32. an aside gamblers?

  33. intuition • counting requires waiting

  34. intuition • counting requires waiting • waiting requires counting

  35. CALM Theorem • CALM: consistency as logical monotonicity • monotonic code ⇒ eventually consistent • non-monotonic ⇒ coordinate only at non-monotonic points of order • conjectures at pods 2010 conference [Hellerstein, SIGMOD Record 2010] • formulationsand theorems in 2011 [Ameloot,et al., PODS 2011]

  36. practical implications • compiler can identify non-monotonic “points of order” • inject coordination code • or mark uncoordinated results as “tainted” • compiler can help programmer think about coordination costs • easy to do this with the right language…

  37. <~ bloom

  38. background: BOOM Analytics • 2005-2010:designed a distributed logic language called Overlog • 2009-2010: rebuilt Hadoop File System and scheduler in Overlog • no kidding – API-compatible with Hadoop, comparable performance • win 1: Orders Of Magnitude smaller, 4 person-months dev time • win 2 (more important) : evolvability • fixed HDFS single point of failure via Paxos-in-Overlog (6 person-weeks) • fixed HDFS scaling limits via state partitioning (1 day!) [Alvaro et al., Eurosys 2010]

  39. we became greedy for more • time to build a language for real programmers. approach: • craft a disorderly DSL for distributed systems • embed in popular host languages. (I chose ruby first.) • embody the CALM theorem in programmer tools • identify points of order in code • synthesize coordination logic, or inject “taint” tracking • high-level analysis/debuggers to pinpoint tricky ordering issues <~ bloom

  40. bud (bloom under development) • bloom embedded as a DSL in ruby • domain-specific code analysis tools • alpha released April, 2011 at http://bloom-lang.net • goodies • code analysis tools • library/example sandbox • EC2 deployment utilities • % gem install bud

  41. http://bloom-lang.net

  42. classic example: shopping cart • replicated, a la Amazon Dynamo • challenge: guarantee eventual consistency of replicas • maxim: use commutative operations • easier said than done! • Bloom/CALM paper shows compiler analysis (i.e. proofs) of the design maxims for correctness, efficiency [Alvaro, et al. CIDR 2011]

  43. conclusion • CALM theorem • what is coordination for? non-monotonicity. • pinpoint non-monotonic points of order • coordination or taint tracking • Bloom • declarative, disorderly DSL for distributed programming • bud: organic Ruby embedding • CALM analysis of monotonicity • synthesize coordination/compensation • released to the dev community this spring • “friends-and-family” alpha at http://bloom-lang.net

  44. influence propagation…? • Technology Review TR10 2010: • “The question that we ask is simple: is the technology likely to change the world?” • Fortune Magazine 2010 Top in Tech: • “Some of our choices may surprise you.” • Twittersphere: • “Read this. Read this now.”

  45. more? http://bloom-lang.net http://boom.cs.berkeley.edu thanks to: Microsoft Research Yahoo! Research IBM Research NSF AFOSR Consensus in Logic [Alvaro, et al. NetDB 2009] BOOM Analytics [Alvaro, et al., Eurosys 2010] Declarative Imperative [Hellerstein, SIGMOD Record 3/2010]CALM + Bloom [Alvaro, et al. CIDR 2011]

  46. roadmap • status report • analytics • scalable systems • research • calm <~ bloom • dp

  47. dp= datapeople facilitating interactions between people and data throughout the analytic lifecycle. http://deepresearch.org

  48. dp Jeff HeerStanford Joe Hellerstein Berkeley Tapan Parikh Berkeley ManeeshAgrawala Berkeley Sean Diana Ravi Kandel MacLean Parikh Kuang Nicholas WesleyChen Kong Willett

  49. dp wrangler intelligent data xformation commentspacesocial data analysis usher/shreddr first-mile data entry socialflows mining, visualizing & browsing email madlib parallel in-database analytics

  50. Remember the missing pieces! visualization software development data product management collaboration/ networking

More Related