1 / 21

John Cavazos Dept of Computer & Information Sciences University of Delaware

John Cavazos Dept of Computer & Information Sciences University of Delaware www.cis.udel.edu/~cavazos/cisc879. Lecture 10 Patterns for Parallel Programming III. Lecture 10: Overview. Cell B.E. Clarification Design Patterns for Parallel Programs Finding Concurrency

nitza
Télécharger la présentation

John Cavazos Dept of Computer & Information Sciences University of Delaware

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. John Cavazos Dept of Computer & Information Sciences University of Delaware www.cis.udel.edu/~cavazos/cisc879 Lecture 10 Patterns for Parallel Programming III

  2. Lecture 10: Overview • Cell B.E. Clarification • Design Patterns for Parallel Programs • Finding Concurrency • Algorithmic Structure • Organize by Tasks • Organize by Data • Supporting Structures

  3. LS-LS DMA transfer (PPU) rc = spe_in_mbox_write(spe[0], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); rc = spe_out_intr_mbox_read(spe[0], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); for (i = 0; i < N; i++) { rc = spe_in_mbox_write(spe[i], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); } for (i = 0; i < N; i++) { pthread_join(pts[i],NULL); } spe_image_close(program); for (i = 0; i < N; i++) { spe_context_destroy(spe[i]); } return 0; } int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int i; spe_program_handle_t *program; program = spe_image_open("../spu/hello"); for (i = 0; i < N; i++) { spe[i] = spe_context_create(0,NULL); spe_program_load(spe[i],program); t_args[i].spe = spe[i]; t_args[i].spuid = i; pthread_create(&pts[i],NULL, &my_spe_thread,&t_args[i]); } void *ls = spe_ls_area_get(spe[1]); unsigned int mbox_data = (unsigned int)ls; printf ("mbox_data %x\n", mbox_data); int rc;

  4. LS-LS DMA transfer (PPU) rc = spe_in_mbox_write(spe[0], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); rc = spe_out_intr_mbox_read(spe[0], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); for (i = 0; i < N; i++) { rc = spe_in_mbox_write(spe[i], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); } for (i = 0; i < N; i++) { pthread_join(pts[i],NULL); } spe_image_close(program); for (i = 0; i < N; i++) { spe_context_destroy(spe[i]); } return 0; } int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int i; spe_program_handle_t *program; program = spe_image_open("../spu/hello"); for (i = 0; i < N; i++) { spe[i] = spe_context_create(0,NULL); spe_program_load(spe[i],program); t_args[i].spe = spe[i]; t_args[i].spuid = i; pthread_create(&pts[i],NULL, &my_spe_thread,&t_args[i]); } void *ls = spe_ls_area_get(spe[1]); unsigned int mbox_data = (unsigned int)ls; printf ("mbox_data %x\n", mbox_data); int rc;

  5. LS-LS DMA transfer (PPU) rc = spe_in_mbox_write(spe[0], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); rc = spe_out_intr_mbox_read(spe[0], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); for (i = 0; i < N; i++) { rc = spe_in_mbox_write(spe[i], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); } for (i = 0; i < N; i++) { pthread_join(pts[i],NULL); } spe_image_close(program); for (i = 0; i < N; i++) { spe_context_destroy(spe[i]); } return 0; } int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int i; spe_program_handle_t *program; program = spe_image_open("../spu/hello"); for (i = 0; i < N; i++) { spe[i] = spe_context_create(0,NULL); spe_program_load(spe[i],program); t_args[i].spe = spe[i]; t_args[i].spuid = i; pthread_create(&pts[i],NULL, &my_spe_thread,&t_args[i]); } void *ls = spe_ls_area_get(spe[1]); unsigned int mbox_data = (unsigned int)ls; printf ("mbox_data %x\n", mbox_data); int rc;

  6. LS-LS DMA transfer (SPU) int main() { gettimeofday(&tv,NULL); printf("spu %lld; t.tv_usec %ld\n", spuid,tv.tv_usec); if (spuid == 0) { unsigned int ea; unsigned int tag = 0; unsigned int mask = 1; ea = spu_read_in_mbox(); printf("ea = %p\n",(void*)ea); mfc_put(&tv,ea + (unsigned int)&tv, sizeof(tv),tag,1,0); mfc_write_tag_mask(mask); mfc_read_tag_status_all(); spu_write_out_intr_mbox(0); } spu_read_in_mbox(); printf("spu %lld; tv.tv_usec = %ld\n", spuid,tv.tv_usec); return 0; }

  7. LS-LS Output -bash-3.2$ ./a.out spu 0; t.tv_usec = 875360 spu 1; t.tv_usec = 876446 spu 2; t.tv_usec = 877443 spu 3; t.tv_usec = 878459 mbox_data f7764000 ea = 0xf7764000 spu 0; tv.tv_usec = 875360 spu 1; tv.tv_usec = 875360 spu 2; tv.tv_usec = 877443 spu 3; tv.tv_usec = 878459

  8. Organize by Data • Operations on core data structure • Geometric Decomposition • Recursive Data

  9. Geometric Deomposition • Arrays and other linear structures • Divide into contiguous substructures • Example: Matrix multiply • Data-centric algorithm and linear data structure (array) implies geometric decomposition

  10. Recursive Data • Lists, trees, and graphs • Structures where you would use divide-and-conquer • May seem that can only move sequentially through data structure • But, there are ways to expose concurrency

  11. Recursive Data Example • Find the Root: Given a forest of directed trees find the root of each node • Parallel approach: For each node, find its successor’s successor • Repeat until no changes • O(log n) vs O(n) Slide Source: Dr. Rabbah, IBM, MIT Course 6.189 IAP 2007

  12. Organize by Flow of Data Organize By Flow of Data Regular Irregular Event-Based Coordination Pipeline

  13. Organize by Flow of Data • Computation can be viewed as a flow of data going through a sequence of stages • Pipeline: one-way predictable communication • Event-based Coordination: unrestricted unpredictable communication

  14. Pipeline performance • Concurrency limited by pipeline depth • Balance computation and communication (architecture dependent) • Stages should be equally computationally intensive • Slowest stage creates bottleneck • Combine lightly loaded stages or decompose heavily-loaded stages • Time to fill and drain pipe should be small

  15. Supporting Structures • Single Program Multiple Data (SPMD) • Loop Parallelism • Master/Worker • Fork/Join

  16. SPMD Pattern • Create single program that runs on each processor • Initialize • Obtain a unique identifier • Run the same program each processor • Identifier and input data can differentiate behavior • Distribute data (if any) • Finalize Slide Source: Dr. Rabbah, IBM, MIT Course 6.189 IAP 2007

  17. SPMD Challenges • Split data correctly • Correctly combine results • Achieve even work distribution • If programs require dynamic load balancing, another pattern may be more suitable (Job Queue) Slide Source: Dr. Rabbah, IBM, MIT Course 6.189 IAP 2007

  18. Loop Parallelism Pattern • Many programs expressed as iterative constructs • Programming models like OpenMP provide pragmas to automatically assign loop iterations to processors Slide Source: Dr. Rabbah, IBM, MIT Course 6.189 IAP 2007

  19. Master/Work Pattern Slide Source: Dr. Rabbah, IBM, MIT Course 6.189 IAP 2007

  20. Master/Work Pattern • Relevant where tasks have no dependencies • Embarrassingly parallel • Problem is determining when entire problem complete Slide Source: Dr. Rabbah, IBM, MIT Course 6.189 IAP 2007

  21. Fork/Join Pattern • Parent creates new tasks (fork), then waits until they complete (join) • Tasks created dynamically • Tasks can create more tasks • Tasks managed according to relationships Slide Source: Dr. Rabbah, IBM, MIT Course 6.189 IAP 2007

More Related