John Cavazos Dept of Computer & Information Sciences University of Delaware

John Cavazos Dept of Computer & Information Sciences University of Delaware www.cis.udel.edu/~cavazos/cisc879 Lecture 10 Patterns for Parallel Programming III

Lecture 10: Overview • Cell B.E. Clarification • Design Patterns for Parallel Programs • Finding Concurrency • Algorithmic Structure • Organize by Tasks • Organize by Data • Supporting Structures

LS-LS DMA transfer (PPU) rc = spe_in_mbox_write(spe[0], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); rc = spe_out_intr_mbox_read(spe[0], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); for (i = 0; i < N; i++) { rc = spe_in_mbox_write(spe[i], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); } for (i = 0; i < N; i++) { pthread_join(pts[i],NULL); } spe_image_close(program); for (i = 0; i < N; i++) { spe_context_destroy(spe[i]); } return 0; } int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int i; spe_program_handle_t *program; program = spe_image_open("../spu/hello"); for (i = 0; i < N; i++) { spe[i] = spe_context_create(0,NULL); spe_program_load(spe[i],program); t_args[i].spe = spe[i]; t_args[i].spuid = i; pthread_create(&pts[i],NULL, &my_spe_thread,&t_args[i]); } void *ls = spe_ls_area_get(spe[1]); unsigned int mbox_data = (unsigned int)ls; printf ("mbox_data %x\n", mbox_data); int rc;

LS-LS DMA transfer (SPU) int main() { gettimeofday(&tv,NULL); printf("spu %lld; t.tv_usec %ld\n", spuid,tv.tv_usec); if (spuid == 0) { unsigned int ea; unsigned int tag = 0; unsigned int mask = 1; ea = spu_read_in_mbox(); printf("ea = %p\n",(void*)ea); mfc_put(&tv,ea + (unsigned int)&tv, sizeof(tv),tag,1,0); mfc_write_tag_mask(mask); mfc_read_tag_status_all(); spu_write_out_intr_mbox(0); } spu_read_in_mbox(); printf("spu %lld; tv.tv_usec = %ld\n", spuid,tv.tv_usec); return 0; }

LS-LS Output -bash-3.2$ ./a.out spu 0; t.tv_usec = 875360 spu 1; t.tv_usec = 876446 spu 2; t.tv_usec = 877443 spu 3; t.tv_usec = 878459 mbox_data f7764000 ea = 0xf7764000 spu 0; tv.tv_usec = 875360 spu 1; tv.tv_usec = 875360 spu 2; tv.tv_usec = 877443 spu 3; tv.tv_usec = 878459

Organize by Data • Operations on core data structure • Geometric Decomposition • Recursive Data

Geometric Deomposition • Arrays and other linear structures • Divide into contiguous substructures • Example: Matrix multiply • Data-centric algorithm and linear data structure (array) implies geometric decomposition

Recursive Data • Lists, trees, and graphs • Structures where you would use divide-and-conquer • May seem that can only move sequentially through data structure • But, there are ways to expose concurrency

Recursive Data Example • Find the Root: Given a forest of directed trees find the root of each node • Parallel approach: For each node, find its successor’s successor • Repeat until no changes • O(log n) vs O(n) Slide Source: Dr. Rabbah, IBM, MIT Course 6.189 IAP 2007

Organize by Flow of Data Organize By Flow of Data Regular Irregular Event-Based Coordination Pipeline

Organize by Flow of Data • Computation can be viewed as a flow of data going through a sequence of stages • Pipeline: one-way predictable communication • Event-based Coordination: unrestricted unpredictable communication

Pipeline performance • Concurrency limited by pipeline depth • Balance computation and communication (architecture dependent) • Stages should be equally computationally intensive • Slowest stage creates bottleneck • Combine lightly loaded stages or decompose heavily-loaded stages • Time to fill and drain pipe should be small

Supporting Structures • Single Program Multiple Data (SPMD) • Loop Parallelism • Master/Worker • Fork/Join

SPMD Pattern • Create single program that runs on each processor • Initialize • Obtain a unique identifier • Run the same program each processor • Identifier and input data can differentiate behavior • Distribute data (if any) • Finalize Slide Source: Dr. Rabbah, IBM, MIT Course 6.189 IAP 2007

SPMD Challenges • Split data correctly • Correctly combine results • Achieve even work distribution • If programs require dynamic load balancing, another pattern may be more suitable (Job Queue) Slide Source: Dr. Rabbah, IBM, MIT Course 6.189 IAP 2007

Loop Parallelism Pattern • Many programs expressed as iterative constructs • Programming models like OpenMP provide pragmas to automatically assign loop iterations to processors Slide Source: Dr. Rabbah, IBM, MIT Course 6.189 IAP 2007

Master/Work Pattern Slide Source: Dr. Rabbah, IBM, MIT Course 6.189 IAP 2007

Master/Work Pattern • Relevant where tasks have no dependencies • Embarrassingly parallel • Problem is determining when entire problem complete Slide Source: Dr. Rabbah, IBM, MIT Course 6.189 IAP 2007

Fork/Join Pattern • Parent creates new tasks (fork), then waits until they complete (join) • Tasks created dynamically • Tasks can create more tasks • Tasks managed according to relationships Slide Source: Dr. Rabbah, IBM, MIT Course 6.189 IAP 2007

John Cavazos Dept of Computer & Information Sciences University of Delaware