1 / 21

Understanding t he SIMD Efficiency o f Graph Traversal o n GPU

Understanding t he SIMD Efficiency o f Graph Traversal o n GPU. Yichao Cheng , Hong An, Zhitao Chen, Feng Li, Zhaohui Wang , Xia Jiang and Yi Peng. University of Science and Technology of China. Breadth - first Search (BFS). Source. A. C. 1. 1. C. A. 2. D. E. F. E. 2. F. 2. D.

manny
Télécharger la présentation

Understanding t he SIMD Efficiency o f Graph Traversal o n GPU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UnderstandingtheSIMDEfficiencyofGraphTraversalonGPU YichaoCheng,HongAn,ZhitaoChen,FengLi,ZhaohuiWang, XiaJiangandYiPeng UniversityofScienceandTechnologyofChina

  2. Breadth-firstSearch(BFS) Source A C 1 1 C A 2 D E F E 2 F 2 D 3 G H I 3 I H 4 G

  3. Breadth-firstSearch(BFS) BFS_Iteration: foru ∈CurrentFrontier for v ∈ u’ s neighbors do if v has not been labeled labelv putvinNextFrontier B C A E F D I H G

  4. Application of BFS • Many datasetsinrealworld are represented by graph • VLSIcircuits • Socialrelationship • Roadconnections • Primitive for buildingcomplexalgorithms • Path-finding • Belief propagation • Points-toAnalysis(PTA)

  5. I TheProblem • GPU relies on highSIMD lanes occupancy to boost performance • 100% efficiency isachievedonly ifall SIMD lanes fall in the same path Do_something_common(); If(thread_id>5){ do_something_red(); } else{ dosomething_blue(); } 100% utilization

  6. I TheProblem • GPU relies on highSIMD lanes occupancy to boost performance • 100% efficiency isachievedonly ifall SIMD lanes fall in the same path Do_something_common(); If(thread_id>5){ do_something_red(); } else{ dosomething_blue(); } 37.5% utilization

  7. I TheProblem • GPU relies on highSIMD lanes occupancy to boost performance • 100% efficiency isachievedonly ifall SIMD lanes fall in the same path Do_something_common(); If(thread_id>5){ do_something_red(); } else{ dosomething_blue(); } 62.5% utilization

  8. TraditionalImplementation The#ofsub-iterationsdependsonthesizeofu’sadjacentlist GPU_BFS_Iteration u = C[tid] for v ∈ u’ s neighbors do end for task1 =4sub-iterations task2=2sub-iterations …

  9. Visualizing the Irregularity Highlyskewed outlierexists distributed betweenawiderage irregularbutconcentrate vertexrange<8

  10. I AlternativeWay • Assign each task withawarpofthreads • Vectorizethe sub-iterations! So, what’s the relationship between graph topology and SIMD efficiency?

  11. TopologyandUtilization • Assign each vertex with a group of threads Warp Group Thread task1=2sub-iterations task2=1sub-iteration

  12. TopologyandUtilization Divide the SIMD underutilization into two parts • InteR-groupUnderutilization (UR) • IntrA-group Underutilization(UA) SIMDWindow

  13. ConclusionsFrom the Model • UR is induced by the heterogeneity of workloads • Affected by the graphtopology • UR issensitive to the group size(S) • LargelogicalSIMDwindowcannarrowthegap • When S = 32, UR=0 • UA is determined by the intrinsic irregularity of vertex degree • It can be limited by shrink the S • When S = 1, UA=0 • UR and UA canconverttoeachother

  14. ComparingDifferent MappingStrategies Scalability good Expansion Rate(ME/s) poor high low

  15. Evaluatingthe SIMDEfficiency • Metricsderivedfromthemodel: UR=inter-groupunderutilization UA=intra-groupunderutilization ME=mappingefficiency UR+UA+ ME =100% • CapturesutilizationtrendwithincreasingS

  16. Explaining the Result Scalability good Expansion Rate(ME/s) poor high low alleviatetheUR, introducingminorUA

  17. Explaining the Result Scalability good MEinahighlevel(~80%) Expansion Rate(ME/s) poor high low

  18. Explaining the Result Scalability good outweighed by the fast-growing UA Expansion Rate(ME/s) poor high low

  19. Explaining the Result do little help to URbut lead to severe UA Scalability good Expansion Rate(ME/s) poor high low

  20. Conclusion • Studythelinkbetweengraphtopo&hardwareutil • PresentamodelforanalyzingthecomponentsofSIMDunderutilization • DiscoverthattheSIMDarewasteddueto: • Develop3metricsforquantifyingSIMDefficiency • Provideafoundationfordevelopingtechniquesofstaticanalysisandruntimeoptimization • imbalanceofvertexdegreedistribution • heterogeneityofeachvertexdegree

  21. Q&A

More Related