High Performance Data Mining On Multi-core systems

SALSATeam Geoffrey Fox Xiaohong Qiu Huapeng Yuan Seung-Hee Bae Indiana University Technology Collaboration George Chrysanthakopoulos Henrik Frystyk Nielsen Microsoft Application Collaboration Cheminformatics Rajarshi Guha David Wild Bioinformatics Haiku Tang Demographics (GIS) Neil Devadasan Indianan University and IUPUI GOALS: Increasing number of cores accompanied by continued data deluge Develop scalable parallel data miningalgorithms with good multicore and cluster performance; understand software runtime and parallelization method. Use managed code (C#) and package algorithms as services to encourage broad use assuming experts parallelize core algorithms. Service Aggregated Linked Sequential Activities: CURRENT RESUTS: MicrosoftCCR supports MPI, dynamic threading and via DSS Service model of computing; detailed performance measurements Speedupsof 7.5 or above on 8-core systems for “large problems” with deterministic annealed (avoid local minima) algorithms for clustering, Gaussian Mixtures, GTM (dimensional reduction); extending to new algorithms/applications High Performance Data Mining On Multi-core systems SALSA

“Main Thread” and Memory M 0 m0 1 m1 2 m2 3 m3 4 m4 5 m5 6 m6 7 m7 Subsidiary threads t with memory mt K=10 Clusters Fractional Overhead f 20 Clusters 30 Clusters 10000/Grain Size Speedup = Number of cores/(1+f) f = (Sum of Overheads)/(Computation per core) Computation  Grain Size n . # Clusters K Overheads are Synchronization:small with CCR Load Balance: good Memory Bandwidth Limit:  0 as K   Cache Use/Interference: Important Runtime Fluctuations: Dominant large n,K All our “real” problems have f ≤ 0.05 and speedups on 8 core systems greater than 7.6 Parallel Programming Strategy Use Data Decomposition as in classic distributed memory but use shared memory for read variables. Each thread uses a “local” array for written variables to get good cache performance DA Clustering Performance Runtime Fluctuations 2% to 5% overhead SALSA

Deterministic Annealing Clustering of Indiana Census Data Decrease temperature (distance scale) to discover more clusters Stop Press:GTM Projection of PubChem: 10,926,94 compounds in 166 dimension binary property space takes 4 days on 8 cores. 64X64 mesh of GTM clusters interpolates PubChem. Could usefully use 1024 cores! David Wild will use for GIS style 2D browsing interface to chemistry p: Total a:Asian r: Renters h: Hispanic Bioinformatics: Annealed Clustering and Euclidean embedding for repetitive sequences, gene/protein families. Use GTM to replace PCA in structure analysis Resolution T0.5 Resolution T0.5 PCA GTM Linear PCA v. nonlinear GTM on 6 Gaussians in 3D GTM Projection of 2 clusters of 335 compounds in 155 dimensions SALSA

N data points E(x) in D dim. space and Minimize F by EM • Deterministic Annealing Clustering • (DAC) • Generative Topographic Mapping • (GTM) • a(x) = 1/N or generally p(x) with  p(x) =1 • g(k)=1 and s(k)=0.5 • Tis annealing temperature varied down from  with final value of 1 • Vary cluster center Y(k)but can calculate Pkand(k)(even for matrix (k)) using IDENTICAL formulae for Gaussian mixtures • Kstarts at 1 and is incremented by algorithm • a(x) = 1 and g(k) = (1/K)(/2)D/2 • s(k) = 1/and T = 1 • Y(k) = m=1MWmm(X(k)) • Choose fixed m(X) = exp( - 0.5 (X-m)2/2 ) • VaryWmandbut fix values ofMandKa priori • Y(k) E(x) Wm are vectors in original high D dimension space • X(k) and m are vectors in 2 dim. mapped space • Deterministic Annealing • Gaussian mixture models • (DAGM) • a(x) = 1 • g(k)={Pk/(2(k)2)D/2}1/T • s(k)= (k)2(taking case of spherical Gaussian) • Tis annealing temperature varied down from  with final value of 1 • VaryY(k) Pkand(k) • Kstarts at 1 and is incremented by algorithm • Link of CCR and MPI • (or cross cluster CCR) • Linear Algebra for C#: • (Multiplication, SVD, Equation Solve) • High Performance C# Math Libraries • We need: Large Windows Cluster • DAGTM: • GTM has several natural annealing • versions based on either • DAC or DAGM: • under investigation General Formula DAC GM GTM DAGTM DAGM • Traditional • Gaussian mixture models GM • As DAGM but set T=1 and fix K • Near Term Future Work: • Parallel Algorithms for • Principal Component Analysis • (PCA) • Random Projection Metric Embedding (Bourgain) • MDS Dimensional Scaling (EM like SMACOF) • Marquardt Algorithm for Newton’s Method • Later: HMM and SVM, Other embedding • Parallel Dimensional Scaling and Metric embedding; Generalized Cluster analysis SALSA

High Performance Data Mining On Multi-core systems

High Performance Data Mining On Multi-core systems

Presentation Transcript

High Performance Data Mining

Database Systems Research on Data Mining

Multi-Core Systems

High Performance Data Mining

Programming Multi-Core Systems

High Performance Data Mining Chapter 4: Association Rules

High Performance Data Mining with Services on Multi-core systems

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster

Parallel Data Mining with Services on Multi-core systems

On-Chip Photonic Communications for High Performance Multi-Core Processors

FREERIDE: System Support for High Performance Data Mining

System Support for High Performance Scientific Data Mining

Multi-Core Performance Modeling for Real-Time Systems

Message-based MVC and High Performance Multi-core Runtime

Association Rule Mining on Multi-Media Data

High Performance on the J90 Systems

Graphs, Data Mining, and High Performance Computing

High Performance Data Mining Chapter 3: Clustering

Multi-core systems

System Support for High Performance Scientific Data Mining

Association Rule Mining on Multi-Media Data