240 likes | 397 Vues
On the role of Interactivity and Data Placement in Big Data Analytics. Srini Parthasarathy OSU. The Data Deluge: Data Data Everywhere. Data Storage is Cheap. 600$ to buy a disk drive that can store all of the world ’ s music. [McKinsey Global Institute Special Report, June ’ 11].
E N D
On the role of Interactivity and Data Placement in Big Data Analytics Srini Parthasarathy OSU
Data Storage is Cheap 600$ to buy a disk drive that can store all of the world’s music [McKinsey Global Institute Special Report, June ’11]
Data almost always exists in connection with other data – integral part of the value proposition.
Social networks Protein Interactions Internet Neighborhood graphs Data dependencies VLSI networks
Big Data Problem: All this data is only useful if we can scalablyextract useful knowledge from such complex data
THIS TALK • THE ROLE OF DATA PLACEMENT IN BIG DATA SYSTEMS • THE ROLE OF VISUALIZATION AND INTERACTION IN BIG DATA ANALYSIS
GLOBAL GRAPHS • What? • System for deploying applications processing complex data • Why? • Seeks balance between high productivity and high performance • How? • Built on top of PNL’s GlobalArrays • Trees (GlobalTrees, GlobalForests) • Relational Arrays (ArrayDB-GA) • Graphs (GlobalGraphs) • Data Placement is key to high performance
Importance of Data Placement • Locality • Placing related items close to each other so they may be processed together • Mitigating Impact of Data Skew • Reducing load imbalance in a parallel setting • Reducing variance in partition samples • Generating Stratified Samples • Improving interactive performance
Key Ideas • Pivotization • Convert data with complex structure into sets • Each element of set captures features of local topology • Hashing into Strata: Hash related sets into similar bins • Can employ a sketch-clustering algorithm • Partitioning: Place Strata into partitions for • Locality • Mitigating Data Skew • Samples
P-1 : P-2 S-4 S-7 S-8 S-12 : S-128 P-3 : : : P-8 S-3 S-4 S-9 S-12 : S-127 Δ1 {1050, 2020, 3130,1800} (SK-1) {1050, 2020, 7225, 2020} (SK-25) S-1 : : S-4 (Δ1, SK-1) (Δ5, SK-5) (Δ12,SK-12) (Δ25,SK-25) : : : S-5 : : : S-128 : : : A A A A A B A A A A A . . . (PS-1) E C E L FC FL B L B C E L B L B C E F E C . . A A . . C C B B L L E E F PIVOT TRANSFORMATIONS SKETCHSORT or SKETCHCLUSTER PARTITIONING & REPLICATION MINWISE HASHING on PIVOT SETS Δ25 . . . . . (PS-25) . . Strata (S) DATA (Δ) PIVOT SETS (PS) SKETCHES(SK)
Frequent Tree Mining • Our proposed approaches shows 100X gains
WebGraph Compression • Linear Scaleup with no loss in compression ratio
PRISM-HD - PRobing the Intrinsic Structure and Makeup of High-dimensional Data HD
PRISM-HD HD • What? • A novel mechanism for exploring complex data • Why? • User is often overwhelmed with characteristics of data • Befuddled on where to start • How? • Given, similarity measure-of-interest • Compute similarity graph at threshold (t) • Key: Graphs are dimensionless • Provide user graph visualization cues • User determines next threshold and repeats
HD HIGH THRESHOLD MODERATE THRESHOLD LOW THRESHOLD
Benefits of Incremental Processing on Twitter HD Incremental estimates on Twitter t1 = 0.95
HD PRISM-HD and Global Graphs in Context:Leveraging Social Media in Emergency Response
Concluding Remarks HD • Data is everywhere • Data is fraught with complexities • Dimensionality, dynamics, structure, massive… • Both data placement and data interactivity have an important role to play in big data analytics • PRISM-HD and GlobalGraphs can help!
Mining Simulation Data Medical Image Analysis Protein Interaction Network (yeast) Thanks for your attention Contact: srini@cse.ohio-state.edu Acknowledgements: Various NSF, NIH, DOE and industry grants