Keynote: Parallel Programming for High Schools

Keynote: Parallel Programming for High Schools Uzi Vishkin, University of Maryland Ron Tzur, Purdue University David Ellison, University of Maryland and University of Indiana George Caragea, University of Maryland CS4HS Workshop, Carnegie-Mellon University, July 26, 2009

Why are we here? • It’s a time of emerging update to what literacy in CS means: Parallel Algorithmic Thinking (PAT)

Goals • Nurture your sense of: • Sense of urgency of shift to parallel within computational thinking • Get a sense of PAT and of potential student understandings • Confidence, competence, and enthusiasm in ability to take on the challenge of promoting PAT in your students • At the end we hope you’ll say: • I understand, I want to do it, I can, and I know it will not happen without me (“irreplaceable member of the Jury”)

Outline (RT) • Intro: What’s all the fuss about parallelism? (UV) • Teaching-learning activities in XMT (DE) • A teacher’s voice (1): It’s the future … & teachable (ST) • PAT Module: goals, plan, hands-on pedagogy, learning theory (RT) • A teacher’s voice (2): XMT Approach/content (ST) • Hands-on: The Merge-Sort Problem (UV) • A teacher’s voice (3): To begin PAT, use XMTC (ST) • How to (start)? (GC) • Q & A (All, 12 min)

Intro: Commodity Computer Systems (UV) • Serial General-purpose computing: • 19462003, 5KHz4GHz. • 2004 Clock frequency growth turns flat • 2004 Onward • Parallelism: ‘only game in town’ • If you want your program to run significantly faster … you’re going to have to parallelize it • General-purpose computing goes parallel • #Transistors/chip 19802011: 29K30B! • #”cores”: ~dy-2003 Intel Platform 2015, March05

Intro: Commodity Computer Systems • 40 Years of Parallel Computing • Never a successful general-purpose parallel computer (easy to program & good speedups) • Grade from NSF Blue-Ribbon Panel on Cyberinfrastructure: ‘F’ !!! “Programming existing parallel computers is as intimidating and time consuming as programming in assembly language”

Intro: Second Paradigm Shift - Within Parallel • Existing parallel paradigm: “Decomposition-First” • Too painful to program • Needed Paradigm: Express only “what can be done in parallel” • Natural (parallel) algorithm: Parallel Random-Access Model (PRAM) • Build both machine (HW) and programming (SW) around this model What could I do in parallel at each step assuming unlimited hardware  . . # ops Serial Paradigm Natural (Parallel) Paradigm . . # ops . . .. .. .. .. time time Time = Work Work = total #ops Time << Work

Middle School Summer Camp Class Picture, July’09 (20 of 22 students)

Demonstration: Exchange Problem (DE) How to exchange the contents of memory locations A & B Let’s refer to input row I as our input state where the values are currently A=2 and B=5 I How can we direct the computer one operation at a time and create a serial algorithm? Let’s try! Hint: Operations include assigning values.

Let’s Look at our First Step I Our first step [ X:=A ] 1

Let’s Look at our Second Step I 1 Our second step [B:=A ] 2

Let’s Look at our Third Step I 1 2 Our third step [ B:=X ] 3 • Our first algorithm* and pseudo programming code: • X:=A • A:=B • B:=X • * Serial exchange, 3 steps, 3 operations, 1 working memory space How many steps? 3 How many operations? 3 What’s the connection between the number of steps & operations? Equal How much working memory space consumed? 1 Space Hands-on Challenge: Can we exchange the contents of A (=2) and B (=5) in fewer steps?

I What is the hint in this figure?

First Step in a parallel algorithm I 1 X:=A and simultaneously Y:=B Can you anticipate the next step?

Second step in a parallel algorithm I X:=A and Y:=B 1 2 A:=Y and simultaneously B:=X 2 How many steps? How many operations? 4 How much working memory space consumed? 2 Can you make any generalizations with respect to serial and parallel problem solving? Parallel algorithms tend to involve fewer Steps, but may cost more operations and may consume more working memory. X:=A and Y:=B A:=Y and B:=X

Array Exchange : A and B as arrays with indices 0 – 9 and input state as shown. Using a single working memory space X, devise an algorithm to exchange the contents of cells with the same index (e.g., replace A0=22 with B0=12) . Consider number of steps, operations. 12 22 22 12 Step 1 X:=A[0] 12 13 Step 2 A[0]:=B[0] Step 3 B[0]:=X Step 4 X:=A[1] For i=0 to 9 Do X:=A[i] A[i]:=B[i] B[i]:=X end How many steps needed to complete the exchange? 30 How many operations? 30 How much working memory space? 1 Your homework asks for the general case of arrays A and B of length n

Array Exchange Problem: Can you parallelize it? Step 1 X[0-9]:=A[0-9] 22 12 12 Step 2 A[0-9]:=B[0-9] 23 13 13 Step 3 B[0-9]:=X[0-9] 24 14 14 25 15 15 Parallel algorithm For i=0 to n-1 pardo X(i):=A(i) A(i):=B(i) B(i):=X(i) end XMTC Program spawn(0,n-1){ var x x:=A( $ ); A( $ ):=B( $ ); B( $ ):=x } 26 16 16 27 17 17 28 18  18 29 19 19 30 20 20 31 21 21 How many steps? How many operations? How much working memory space consumed? Answer the above questions for the general case of arrays A and B of length n? 3 30 10 3 steps, 3n operations, n spaces

Array Exchange Algorithm: A highly parallel approach Step 1 X[0-9]:=A[0-9] And Y[0-9]:=B[0-9] 22 12 22 12 23 13 23 13 Step 2 A[0-9]:=Y[0-9] And B[0-9]:=X[0-9] 24 14 24 14 25 15 25 15 26 16 26 16 For i=1 to n pardo X(i):=A(i) and B(i):=X(i) Y(i):=B(i) and A(i):=Y(i) end 27 17 27 17 28 18 28 18 29 19 29 19 30 20 30 20 31 21 31 21 How many steps? How many operations? How much working memory space consumed? Answer the above questions for the general case of arrays A and B of length n? 2 40 20 2 steps , 4n operations, 2n spaces

Intro: Second Paradigm Shift (cont.) • Late 1970s THEORY • Figure out how to think algorithmically in parallel • Huge success. But • 1997 Onward: PRAM-On-Chip @ UMD • Derive specs for architecture; design and build • Above premises contrasted with: • “Build-first, figure-out-how-to-program-later” approach J. Hennessy: “Many of the early [parallel] ideas were motivated by observations of what was easy to implement in the hardware rather than what was easy to use”

Pre Many-Core Parallelism: Three Thrusts Improving single-task completion time for general-purpose parallelism was not the main target of parallel machines • Application-specific: • computer graphics • Limiting origin • GPUs: great performance if you figure out how 2. Parallel machines for high-throughput (of serial programs) • Only choice for “HPC”Language standards, but many issues (F!) • HW designers (that dominate vendors): YOU figure out how to program (their machines) for locality.

Pre Many-Core Parallelism: 3 Thrusts (cont.) • Currently, how future computer will look is unknown • SW Vendor impasse: What can a non-HW entity do without ‘betting on the wrong horse’? • Needed - successor to Pentium for multi-core area that: • Is easy to program (hence, learning – hence, teaching) • Gives good performance with any amount of parallelism • Supports application programming (VHDL/Verilog, OpenGL, MATLAB) AND performance programming • Fits current chip technology and scales with it (particularly – strong speed-ups for single-task completion time) • Hindsight is always 20/20: • Should have used the benchmark of Programmability  TEACHABILITY !!!

Pre Many-Core Parallelism: 3 Thrusts (cont.) • PRAM algorithmic theory • Started with a clean slate target: • Programmability • Single-task completion time for general-purpose parallel computing • Currently: the theory common to all parallel approaches • necessary level of understanding parallelism; As simple as it gets; Ahead of its time: avant-garde • 1990s Common wisdom (LOGP): never implementable • UMD Built: eXplicit Multi-Threaded (XMT) parallel computer • 100x speedups for 1000 processors on chip • XMTC – programming language • Linux-based simulator – download to any machine • Most importantly: TAUGHT IT • Graduate  seniors  freshmen  high school  middle school • Reality check: The human factor YOU Teachers  Students

One Teacher’s Voice (RT) • Mr. Shane Torbert (could not join us - sister’s getting married!) • Thomas Jefferson (TJ) High School • Two years of trial • Interview question: Why you gave Vishkin’s XMT a try? • Observe video segment #1 http://www.umiacs.umd.edu/users/vishkin/TEACHING/SHANE-TORBERT-INTERVIEW7-09/01 Shane Why XMT.m4v(It requires either some iTune installation or other m4v player)

Summary of Shane’s Thesis: It’s the Future … and Teachable !!!

Teaching PAT with XMT-C • Overarching goal: • Nurture a (50-year) generation of CS enthusiasts ready to think/work in parallel (programmers, developers, engineers, theoreticians, etc.) • Module goals for student learning • Understand what are parallel algorithms • Understand the differences, and links, between parallel and serial algorithms (serial as a special case of parallel - single processor) • Understand and master how to: • Analyze a given problem into the shortest sequence of steps within which all possible concurrent operations are performed • Program (code, run, debug, improve, etc.) parallel algorithms • Understand and use measures of algorithm efficiency • Run-time • Work – distinguish number of operations vs. number of steps • Complexity

Teaching PAT with XMT-C (cont.) • Objectives - students will be able to: • Program parallel algorithms (that run) in XMTC • Solve general-purpose, genuine parallel problems • Compare and choose best parallel (and serial) algorithms • Explain why an algorithm is serial/parallel • Propose and execute reasoned improvements to their own and/or others’ parallel algorithms • Reason about correctness of algorithms: Why an algorithm provides a solution to a given problem?

Hands-0n: The Bill Gates Intro Problem(from Baltimore Polytechnic Institute) • Please form small groups (3-4) • Consider Bill Gates, the richest person on earth • Well, he can hire as many helpers for any task in his life … • Suggest an algorithm to accomplish the following morning tasks in the least number of steps and go out to work Start = Mr. Gates in pajama gown Fasten belt Put on left sock Put on underpants Put on shirt Put on right shoe Put on right sock Remove pajama gown Put on undershirt Tuck shirt into pants Put on left shoe Put on pants

10-Year Old Solves Bill Gates … • Play tape

A Solution for Bill Gates Gates in pajama gown I 1 Remove pajama gown Put on underpants Put on right sock Put on left sock Put on undershirt 2 Put on pants 3 Put on shirt Tuck shirt into pants 4 Put on right shoe Put on left shoe 5 Fasten belt Moral: Parallelism introduces both constraints and opportunities Constraints: We can’t just assume we can accomplish everything at once! Opportunities: Can be much faster than serial! [5 parallel steps versus 11 serial steps]

Pedagogical Considerations (1) • In your small groups discuss: How might solving the Bill Gates problem help students in learning PAT? Will you use it as an intro to a PAT module? Why? • Be ready to share your ideas with the whole group • Whole group discussion of Bill Gates problem to initiate PAT

A Brain-based Learning Theory • Understanding: anticipation and reasoning about invariant relationship between activity and its effects (AER) • Learning: transformation in such anticipation, commencing with available and proceeding to intended • Mechanism: Reflection (two types) on activity-effect relationship (Ref*AER) • Type-I: comparison between goal and actual effect • Type-II: comparison across records of experiences/situations in which AER has been used consistently • Stages: • Participatory (provisional, oops), Anticipatory (transfer enabling, succeed) • For more, see www.edci.purdue.edu/faculty_profiles/tzur/index.html

A Teacher’s Voice: XMT Approach/Content • Pay attention to his emphasis on student development of anticipation of run-time using “complexity” analysis (deep level of understanding even for serial thinking) • Play video segments #2 and #3 (5:30 min) http://www.umiacs.umd.edu/users/vishkin/TEACHING/SHANE-TORBERT-INTERVIEW7-09/02 Shane Ease of Use.m4v http://www.umiacs.umd.edu/users/vishkin/TEACHING/SHANE-TORBERT-INTERVIEW7-09/03 Shane Content Focus.m4v • Shane’s suggested first trial with teaching this material: - Where: your CS AP class (you most likely ask when …) - When: Between the AP exam and the end of the school year.

PAT Module Plan • Intro Tasks: Create informal algorithmic solutions for problems students can relate to; parallelize • Bill Gates; Way out of a maze; train a dog to fetch a ball; standing blindfolded in line, the toddler problem, building a sand castle, etc. • Discussion: • What is Serial? Parallel? How do they differ? Advantages and disadvantages of both (tradeoffs)? Steps vs. operations? Breadth-first vs. Depth-first searches? • Establish XMT environment: • Installation (Linux, Simulator) • Programming syntax (Logo? C? XMT-C?) – “Hello World” and beyond … • Algorithms for Meaningful Problems • For each problem: create parallel and serial algorithms that solve it; analyze and compare them (individual, pairs, small groups, whole class • Revisit discussion of how serial and parallel differ

Problem Sequence • Exchange problems • Ranking problems • Summation and Prefix-Sums (application – compaction) • Matrix multiplication problems • Sorting problems (including merge-sort , integer-sort and sample-sort) • Selection problems (finding the median) • Minimum problems • Nearest-one problems • See also: • www.umiacs.umd.edu/users/vishkin/XMT/index.shtml • www.umiacs.umd.edu/users/vishkin/XMT/index.shtml#tutorial • www.umiacs.umd.edu/users/vishkin/XMT/sw-release.html • www.umiacs.umd.edu/users/vishkin/XMT/teaching-platform.html

PRAM-On-Chip Silicon: 64-processor, 75MHz prototype FPGA Prototype builtn=4, #TCUs=64, m=8, 75MHz. The system consists of 3 FPGA chips: 2 Virtex-4 LX200 & 1 Virtex-4 FX100(Thanks Xilinx!) Block diagram of XMT

AMD Opteron 2.6 GHz, RedHat Linux Enterprise 3, 64KB+64KB L1 Cache, 1MB L2 Cache (none in XMT), memory bandwidth 6.4 GB/s (X2.67 of XMT) M_Mult was 2000X2000 QSort was 20M XMT enhancements: Broadcast, prefetch + buffer, non-blocking store, non-blocking caches. XMT Wall clock time (in seconds) App. XMT Basic XMT Opteron M-Mult 179.14 63.7 113.83 QSort 16.71 6.592.61 Assume (arbitrary yet conservative) ASIC XMT: 800MHz and 6.4GHz/s Reduced bandwidth to .6GB/s and projected back by 800X/75 XMT Projected time (in seconds) App. XMT Basic XMT Opteron M-Mult 23.5312.46 113.83 QSort 1.971.42 2.61 Some experimental results (UV) • Simulation of 1024 processors: 100X on standard benchmark suite for VHDL gate-level simulation. for 1024 processors [Gu-V06] • Silicon area of 64-processor XMT, same as 1 commodity processor (core)

Hands-On Example: Merging A B • Input: • Two arrays A[1. . n], B[1. . N] • Elements from a totally ordered domain S • Each array is monotonically non-decreasing • Merging task (Output): • Map each of these elements into a monotonically non-decreasing array C[1..2n] Serial Merging algorithm SERIAL − RANK(A[1 . . ];B[1. .]) Starting from A(1) and B(1), in each round: • Compare an element from A with an element of B • Determine the rank of the smaller among them Complexity: O(n) time (hence, also O(n) work...) Hands-on: How will you parallelize this algorithm? 4 6 8 9 16 17 18 19 20 21 23 25 27 29 31 32 1 2 3 5 7 10 11 12 13 14 15 22 24 26 28 30

Partitioning Approach Input size for a problem=n; Design a 2-stage parallel algorithm: • Partition the input in each array into a large number, say p, of independent small jobs • Size of the largest small job is roughly n/p • Actual work - do the small jobs concurrently, using a separate (possibly serial) algorithm for each “Surplus-log” parallel algorithm for Merging/Ranking for 1 ≤ i ≤ n pardo • Compute RANK(i,B) using standard binary search • Compute RANK(i,A) using binary search Complexity: W=O(n log n), T=O(log n)

Middle School Students Experiment with Merge/Rank

Linear work parallel merging: using a single spawn Stage 1 of algorithm: Partitioningfor 1 ≤ i ≤ n/p pardo [p <= n/log and p | n] • b(i):=RANK(p(i-1) + 1),B) using binary search • a(i):=RANK(p(i-1) + 1),A) using binary search Stage 2 of algorithm: Actual work Observe Overall ranking task broken into 2p independent “slices”. Example of a slice Start at A(p(i-1) +1) and B(b(i)). Using serial ranking advance till: Termination condition Either some A(pi+1) or some B(jp+1) loses Parallel program 2p concurrent threads using a single spawn-join for the whole algorithm ExampleThread of 20: Binary search B. Rank as 11 (index of 15 in B) + 9 (index of 20 in A). Then: compare 21 to 22 and rank 21; compare 23 to 22 to rank 22; compare 23 to 24 to rank 23; compare 24 to 25, but terminate since the Thread of 24 will rank 24.

Linear work parallel merging (cont’d) Observation 2p slices. None has more than 2n/p elements (not too bad since average is 2n/2p=n/p elements) Complexity Partitioning takes W=O(p log n), and T=O(log n) time, or O(n) work and O(log n) time, for p <= n/log n Actual work employs 2p serial algorithms, each takes O(n/p) time Total W=O(n), and T=O(n/p), for p <= n/log n IMPORTANT: Correctness & complexity of parallel programs Same as for algorithm This is a big deal. Other parallel programming approaches do not have a simple concurrency model, and need to reason w.r.t. the program

A Teacher’s Voice: Start PAT with XMT • Observe Shane’s video segment #4 http://www.umiacs.umd.edu/users/vishkin/TEACHING/SHANE-TORBERT-INTERVIEW7-09/04 Shane Word to Teachers.m4v

How to (start)? (GC) • Contact us! ! ! • Observe online teaching sessions (more to be added soon) • Contact us • Download and install simulator • Read manual • Google XMT or www.umiacs.umd.edu/users/vishkin/XMT/index.shtml • Solve a few problems on your own • Try programming a parallel algorithm in XMTC for prefix-sums … • Contact us • Follow teaching plan (slides #29-30) • Did we already say: CONTACT US ?!?! (entire team waiting for your call …)

???

Additional Intro Problems • See next slides

The Year-Old Toddler Problem: Initial input state: Sleeping toddler in crib End: Toddler ready to go to daycare … Analyze for parallelism: Steps and operations Prepare mush Pack toddler’s lunch Bring toddler to kitchen Sleeping toddler in crib Put on coat Put on right sock Wake up toddler and take out of crib Put on shirt Tuck shirt into pants Put on left sock Put on pants Put on left shoe Put on right shoe Remove pajamas Put toddler in high chair and spoon mush Take toddler to car and put into car seat sleeping toddler in crib Remove diaper, clean bottom, and put on clean diaper (must be done after mushing)

How can we direct the computer to search this maze and help the cat get to the milk a parallel algorithm (depth first search?) 0 Back to A Over to E We might imagine locations in the maze that force a decision and call these junctions. Back to A A Over to F H B Back to A We might say the computer is forced to make a decision at junction A Over to G C D Back to A E Over to H Back to A And progresses in a left handed fashion to B… Until it reaches a blockage C… And must return to B… And proceed to the next junction D but that returns to itself… G F Back to B

Back-up slide: FPGA 64-processor, 75MHz prototype Specs and aspirations • Multi GHz clock rate • FPGA Prototype builtn=4, #TCUs=64, m=8, 75MHz. • The system consists of 3 FPGA chips: • 2 Virtex-4 LX200 & 1 Virtex-4 FX100(Thanks Xilinx!) Block diagram of XMT • - Cache coherence defined away: Local cache only at master thread control unit (MTCU) • Prefix-sum functional unit (F&A like) with global register file (GRF) • Reduced global synchrony • Overall design idea: no-busy-wait FSMs

Keynote: Parallel Programming for High Schools