1 / 24

On the role of Interactivity and Data Placement in Big Data Analytics

On the role of Interactivity and Data Placement in Big Data Analytics. Srini Parthasarathy OSU. The Data Deluge: Data Data Everywhere. Data Storage is Cheap. 600$ to buy a disk drive that can store all of the world ’ s music. [McKinsey Global Institute Special Report, June ’ 11].

rusk
Télécharger la présentation

On the role of Interactivity and Data Placement in Big Data Analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On the role of Interactivity and Data Placement in Big Data Analytics Srini Parthasarathy OSU

  2. The Data Deluge: Data Data Everywhere

  3. Data Storage is Cheap 600$ to buy a disk drive that can store all of the world’s music [McKinsey Global Institute Special Report, June ’11]

  4. Data does not exist in isolation.

  5. Data almost always exists in connection with other data – integral part of the value proposition.

  6. Social networks Protein Interactions Internet Neighborhood graphs Data dependencies VLSI networks

  7. Big Data Problem: All this data is only useful if we can scalablyextract useful knowledge from such complex data

  8. THIS TALK • THE ROLE OF DATA PLACEMENT IN BIG DATA SYSTEMS • THE ROLE OF VISUALIZATION AND INTERACTION IN BIG DATA ANALYSIS

  9. GLOBAL GRAPHS

  10. GLOBAL GRAPHS • What? • System for deploying applications processing complex data • Why? • Seeks balance between high productivity and high performance • How? • Built on top of PNL’s GlobalArrays • Trees (GlobalTrees, GlobalForests) • Relational Arrays (ArrayDB-GA) • Graphs (GlobalGraphs) • Data Placement is key to high performance

  11. Importance of Data Placement • Locality • Placing related items close to each other so they may be processed together • Mitigating Impact of Data Skew • Reducing load imbalance in a parallel setting • Reducing variance in partition samples • Generating Stratified Samples • Improving interactive performance

  12. Key Ideas • Pivotization • Convert data with complex structure into sets • Each element of set captures features of local topology • Hashing into Strata: Hash related sets into similar bins • Can employ a sketch-clustering algorithm • Partitioning: Place Strata into partitions for • Locality • Mitigating Data Skew • Samples

  13. P-1 : P-2 S-4 S-7 S-8 S-12 : S-128 P-3 : : : P-8 S-3 S-4 S-9 S-12 : S-127 Δ1 {1050, 2020, 3130,1800} (SK-1) {1050, 2020, 7225, 2020} (SK-25) S-1 : : S-4 (Δ1, SK-1) (Δ5, SK-5) (Δ12,SK-12) (Δ25,SK-25) : : : S-5 : : : S-128 : : : A A A A A B A A A A A . . . (PS-1) E C E L FC FL B L B C E L B L B C E F E C . . A A . . C C B B L L E E F PIVOT TRANSFORMATIONS SKETCHSORT or SKETCHCLUSTER PARTITIONING & REPLICATION MINWISE HASHING on PIVOT SETS Δ25 . . . . . (PS-25) . . Strata (S) DATA (Δ) PIVOT SETS (PS) SKETCHES(SK)

  14. Frequent Tree Mining • Our proposed approaches shows 100X gains

  15. WebGraph Compression • Linear Scaleup with no loss in compression ratio

  16. PRISM-HD - PRobing the Intrinsic Structure and Makeup of High-dimensional Data HD

  17. Visualization and Interactivity are key to discovery

  18. PRISM-HD HD • What? • A novel mechanism for exploring complex data • Why? • User is often overwhelmed with characteristics of data • Befuddled on where to start • How? • Given, similarity measure-of-interest • Compute similarity graph at threshold (t) • Key: Graphs are dimensionless • Provide user graph visualization cues • User determines next threshold and repeats

  19. HD HIGH THRESHOLD MODERATE THRESHOLD LOW THRESHOLD

  20. Benefits of Knowledge Caching HD

  21. Benefits of Incremental Processing on Twitter HD Incremental estimates on Twitter t1 = 0.95

  22. HD PRISM-HD and Global Graphs in Context:Leveraging Social Media in Emergency Response

  23. Concluding Remarks HD • Data is everywhere • Data is fraught with complexities • Dimensionality, dynamics, structure, massive… • Both data placement and data interactivity have an important role to play in big data analytics • PRISM-HD and GlobalGraphs can help!

  24. Mining Simulation Data Medical Image Analysis Protein Interaction Network (yeast) Thanks for your attention Contact: srini@cse.ohio-state.edu Acknowledgements: Various NSF, NIH, DOE and industry grants

More Related