1 / 6

There Is No Big Data* * Unless You Are Big Brother or Big Tech

There Is No Big Data* * Unless You Are Big Brother or Big Tech. Zachary G. Ives University of Pennsylvania and Inc. (visiting for 2 more weeks). We’ve All Heard the Story…. Google has multi PB - EB of data Facebook 10PB data warehouse (Parikh keynote)

onan
Télécharger la présentation

There Is No Big Data* * Unless You Are Big Brother or Big Tech

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. There Is No Big Data* *Unless You Are Big Brother or Big Tech Zachary G. Ives University of Pennsylvania and Inc. (visiting for 2 more weeks)

  2. We’ve All Heard the Story… • Google has multi PB - EB of data • Facebook 10PB data warehouse (Parikh keynote) • Walmart 500TB data warehouse in 2004 • "Data is becoming so huge, we in academia need to invent new BIG DATA capabilities!” • … Marketers gave us a guide: ~4 “V”s • … We latched onto volume + velocity

  3. But Wait – Google has Jeff Dean etc.!Why Do They Need Us to Handle Scale? BigTech are solving scale themselves, and leading the way. They have real data, real workloads, real $$, real machines. MapReduce, Pregel, F1, Millwheel, Puma, Presto, …

  4. Worse: The Problem Is Usually NotToo Much Data to Handle… Rowstron+ 12: even Big Tech data isn’t always BIG • Analytics clusters @ Microsoft, median job < 14GB • Median Facebook job < 100GB What about academia, science, or “medium tech” data? • A genome • A giant Twitter crawl • Wikipedia with all history and languages Single server-sized… But not what we want to look at: most of the data we want to process needs complementary data we don’t own!!! 3G bases 284M edges, 53M entities 30M pages

  5. Big Data is a Product, Not a Source Current focus: BIG? • Not just “variety” – proprietary “little data”: • Specialized science data • Individual observations • E-commerce data • …

  6. “Growing” small dataBIG DATA Many issues in “big data integration” • How do we exploit existing knowledge? • How do we take advantage of scale, of user populations, and history? But: How do we convince “small data” owners they WANT to be BIG DATA? • Need to measure impact of data Next step beyond provenance, responsibility, … • Need to incentivizecontributions Credit, badges, h-index, $$$, … • Need user DRM or DUAs, not just corporate DRM, EULAs, and SLAs “I am getting bigger and bigger”

More Related