Course Goals

Course Goals 1. Introduction to the theory of parallel algorithms Parallel algorithmic thinking obtaining good speed-ups over best serial algorithm. Class presentations Study the theory of parallel algorithms; design & asymptotic analysis of parallel algorithms. Programming Hard speedups on real HW; “most” advanced algorithms  understanding; YOU can do it; Off-line… 2. Feasibility of on-chip general-purpose parallel computer Focus Single task completion. Throughput: important, but different. Overview 3. Affecting CPUs and/or GPUs. Report. • Greater integration of CPU and parallel processors • More shared memory over local memory

Is mainstream computing on an XMT trajectory? Argue Happening through: • Integration • Support of fine-grained irregular apps: - (Beyond deep) machine learning* • Business competition *SW driving market-place machine learning apps: XMT 3.3X over top-of-the-line GPU. GPU improved multi-core

Tighter integration CPU&parallel execution Multi-cores Integrated CPU and integrated graphics; e.g., 72 graphics processors (GrPU) on chip in 2017 (Intel). Issue Parallelism executed by contractors vs subcontractors Image Independent players vs Conductor/pianist+Orchestra Classic multi-core Contractors--all processors equal Recent multi-cores CPU: conductor of GrPUs. CPU↔GrPUs: tight private↔shared cache transition GPUs Past CPU ↔ GPU exchange jobs “over a fence” Uniform address space as of the 2010s (Nvidia) But, there is more…

Traditional GPUs: Optimize pico-Joules Optimize irregular parallel algorithms Dally 20091Motivation: Data motion is energy costly Dally, US patent 20152 Local caches mean separate memories, separate functionalities. Bad use of capacities (e.g., bandwidth) Combine several logically separate memories into a single unified memory Single set of shared memory banks. Not local! Dynamic allocation of bandwidth 3Similar to the UMD PRAM-On-Chip XMT architecture balance of fine-grained access and locality! • Minimize expensive data movement • Optimize use of scarce bandwidth • Provide rich, explicitly managed storage hierarchy to reduce demand, increase utilization • “Efficiency = Locality”  Far behind on irregular apps. We were not shy pointing this out… 1 Source: Keynote “End of denial architectures”. (Per 2015 patent: “Traditional GPUs”) 2Patent Unified Streaming Multiprocessor Memory reflects increasing size of shared caches in GPUs starting ~2012 with much better performance on irregular parallel programs 3See the streaming-multiprocessor of the Nov’17 Nvidia Volta, including register file.

More integration or more discrete parallel accelerators? Discrete accelerators Integrated parallelism CPUs takes some silicon. But if CPU count is low, need not fall behind on peak. Can do better in mixed serial/parallel tasks Less power • Dedicated hardware towards specific app can be greatly optimized

(Approximate) Chronology of business competition Integrated solution Discrete solution CPU + graphics card. Winner CPU + GPU. Winner CPU + (“discrete”) GPU Still: winner in the high-end Accommodates both! Can matrix-mult-type suffice? the Machine-learning court? 1990s ISA SIMD extensions MMX 2000s ? 2010s Integrated GPU Iris/Pro/Plus, HD Winner in the low-/mid-end &mobile Open CL? 2020s Winner. Unless … new apps Is the ball in For fun: contrast with Wall Street

OpenCLTM(Open Computing Language) “A multi-vendor open standard for general-purpose parallel programming of heterogeneous systems that include CPUs, GPUs, and other processors. OpenCL provides a uniform programming environment for software developers to write efficient, portable code for high performance compute servers, desktop computer systems, and handheld devices” More level playing field around general-purpose parallel programming?

Anticipate Mounting pressure towards support of: • Ease of programming; Math induction? • Fine-grained irregular parallel programs; GPU papers… • Support Parallel (PRAM) algorithms • Teaching every CS major some notion of parallelism Time will tell whether, how and when this will unfold Wild cardCompetition in the CPU space. However, I foresee a strong XMT/PRAM trajectory.

Concrete direction • https://software.intel.com/en-us/vtune-amplifier-help-gpu-opencl-application-analysis-view

Course Goals

Course Goals

Presentation Transcript

Goals

Goals

Creating Student-Centered Course Goals

Goals of this Course

Goals of the Course

CSE 332: Goals of course

D56 Course Goals

Goals of This Course

Course Goals

The Goals of This Course

Linking assignments with course and program goals

Goals of this Course

Goals of the course

Module 2 Course Objectives & Goals

Goals of Course

Goals and Approach of the Course

Course Goals

MGT 308 Business Responsibilities in Society Course Goals

The Goals for this Course

Course Goals

Teaching Goals, Learning Styles, and Course Design

Goals of Course

Course Goals

Course Goals

Presentation Transcript

Goals

Goals

Creating Student-Centered Course Goals

Goals of this Course

Goals of the Course

CSE 332: Goals of course

D56 Course Goals

Goals of This Course

Course Goals

The Goals of This Course

Linking assignments with course and program goals

Goals of this Course

Goals of the course

Module 2 Course Objectives &amp; Goals

Goals of Course

Goals and Approach of the Course

Course Goals

MGT 308 Business Responsibilities in Society Course Goals

The Goals for this Course

Course Goals

Teaching Goals, Learning Styles, and Course Design

Goals of Course

Module 2 Course Objectives & Goals