1 / 20

Software and Hardware Requirements for Next-Generation Data Analytics

Software and Hardware Requirements for Next-Generation Data Analytics. John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory October, 2010. Graphs are everywhere in science. Astrophysics Problem : Outlier detection. Challenges : massive datasets,

caitir
Télécharger la présentation

Software and Hardware Requirements for Next-Generation Data Analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Software and Hardware Requirements for Next-Generation Data Analytics John Feo Center for Adaptive Supercomputing SoftwarePacific Northwest National Laboratory October, 2010

  2. Graphs are everywhere in science Astrophysics Problem: Outlier detection. Challenges: massive datasets, temporal variations. Graph problems: clustering, matching. Bioinformatics Problem: Identifying drug target proteins. Challenges: Dataheterogeneity, quality. Graph problems: centrality, clustering. Social Informatics Problem: Discover emergent communities, model spread of information. Challenges: new analytics routines, uncertainty in data. Graph problems: clustering, shortest paths, flows.

  3. … and in commerce 1000x growthin 3 years! • has more than 300 million active users • Sample queries: • Allegiance switching: identify entities that switch communities. • Community structure: identify the genesis and dissipation of communities • Phase change: identify significant change in the network structure • Thought leaders: identify influential individuals that drive events • Graph features: • Topology: Interaction graph is low-diameter and has no good separators • Irregularity: Communities are not uniform in size • Overlap: individuals are members of one or more communities

  4. Small-world and scale-free “Six degrees of separation” • Scale-free (power-law): • difficult to partition/load-balance • work concentrates in a few nodes RMAT graph with a million vertices • Low diameter (small-world): • work explodes • difficult to partition/load-balance • high % of nodes are visited quickly

  5. Grids, Erdős–Rényi, and Scale-Free Graphs USA Roadmap Communication trace from execution of ½-approx weighted matching(data distributed using Metis) Scale-Free Erdős–Rényi

  6. Challenges • Problem size • Ton of bytes, not ton of flops • Little data locality • Have only parallelism to tolerate latencies • Low computation to communication ratio • Single word access • Threads limited by loads and stores • Synchronization points are simple elements • Node, edge, record • Work tends to be dynamic and imbalanced • Let any processor execute any thread

  7. System requirements Cray XMT • Global shared memory • No simple data partitions • Local storage for thread private data • Network support for single word accesses • Transfer multiple words when locality exists • Multi-threaded processors • Hide latency with parallelism • Single cycle context switching • Multiple outstanding loads and stores per thread • Full-and-empty bits • Efficient synchronization • Wait in memory • Message driven operations • Dynamic work queues • Hardware support for thread migration

  8. Center for Adaptive Supercomputer Software Driving development of next-generation multithreaded architectures and methods for irregular problems Sponsored by DOD Commerce Data Analytics DATA Scientific Simulations Science Knowledge Discovery Internet Policy Trend Analysis Sensor Networks Databases

  9. Partners

  10. Analytic methods and applications Semantic Web FaceBook - 300 M users Train Anthrax Bus Money Endo Hayashi Zaire People, Places, & Actions Community Activities Security National Security SmartGrid Blog Analysis Anomaly detection Connect-the-dots N-x contingency analysis Community thought leaders

  11. Research focus areas BioInformatics Computer Security Sensor Networks SmartGrid Applications Bayesian networks Social networks Mesh generation MapReduce Methods Clustering Semantic Databases N-x contingency analysis Chapel for hybrid systems Languages RuntimeSystem Performance analysis and tools Compiler and runtime system Communication software for hybrid systems Next generation multithreaded architectures Architecture

  12. Methods for data analytics Influential Factors • Degree distribution • Normal • Scale-free • Planar or non-planar • Static or dynamic • Weighted or unweighted • Weight distribution • Typed or untyped edges Load imbalanceNon-planar Difficult to partition Concurrent insertsand deletions • Paths • Shortest path • Betweenness • Min/max flow • Structures • Spanning trees • Connected components • Graph isomorphism • Groups • Matching/Coloring • Partitioning • Equivalence

  13. Systems for large-scale analytics Netezza TwinFin Cray XMT Graph resides inXMT memory RDBSruns on cluster

  14. Dynamic Bayesian Network Model for Atmospheric Sensor Network Validation vap vap vap sky ir temp sky ir temp sky ir temp tbsky 31 tbsky 31 tbsky 31 wspd_va wspd_va wspd_va precip-tbrg precip-tbrg precip-tbrg radar7 radar7 radar7 percent_opaque percent_opaque percent_opaque radar13 radar13 radar13 radar19 radar19 radar19 • Replicate per time step • Add dependencies across time steps (not shown)

  15. DBN to Junction Tree Conversion vap vap vap sky ir temp sky ir temp sky ir temp tbsky 31 tbsky 31 tbsky 31 wspd_va wspd_va wspd_va precip-tbrg precip-tbrg precip-tbrg radar7 radar7 radar7 • Convert dynamic Bayesian network to junction tree for inferencing • Each node in the junction tree is a clique or super node containing several nodes from original Bayesian network • Junction Tree based “Evidence Propagation” is an efficient method of propagating the effect of any variable’s state to every other variable in the BN percent_opaque percent_opaque percent_opaque radar13 radar13 radar13 radar19 radar19 radar19

  16. Evidence Propagation is highly irregular • small systems have 100s of millions of nodes • Compute per node is unbalanced • Degree per node is irregular • Data moves up and down • Loop parallelism intra-node • Task parallelism inter-node (recursion, futures) • Data flow scheduling • Data synchronization

  17. Atmospheric Sensor Network Validation Framework

  18. Semantic analysis PNNL, SNL, Cray Mary Blue bumps Pink rash has symptom has symptom has symptom High Fever John Alice has symptom Mayo Clinic’s patient database has 650K columns • Understanding the relationships among data • Data intensive science • National security • Commerce • Data and relationships best expressed as triples and graphs • <JohnownsDog> 18

  19. XMT’s potential for semantic analysis JOB 3: Delete Duplicates • <John studied under Jim Browne> + <Jim Browne teaches at UT Austin>  <John attended UT Austin> JOB 0: Transitive Closure Original Diagram from Urbani et al. "Scalable Distributed Reasoning using MapReduce" ISWC 2009 • 865 million triples • RDFS closure • Inferring new relationships and attributes • Rule based

  20. Summary • The new HPC is irregular and sparse • Bad news: we need new architectures • Good news: there are commercial and consumer applications • Shared memory is necessary, but not sufficient • Need processors that can fill the memory system with requests • Need memory systems that support millions of simultaneous requests • Need fine-grain hardware synchronization in memory

More Related