Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

A Scalable PipelinedAssociative SIMD ArrayWith Reconfigurable PEInterconnection Network For Embedded Applications Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA

Outline of Talk • SIMD Associative Computing • Associative Search & Associative Processing • PE Interconnection Network • Multiple Instruction Streams • ASC Processor (Work Mostly Complete) • Pipelined Architecture • Reconfigurable PE Interconnection Network • Processor and Network Performance • MASC Architecture (Work in Progress) • Implementation of Task Manager and Instruction Stream • Sample Code • Architecture and Sample execution • Sample Application • String Matching • Conclusion

Associative Computing Tabular data, cells referenced by content = Associative Search On successful search is, cell is flagged + Associative Processing Flagged cells are processed further SIMD Associative Computing Memory cell uses associative processing element (APE) to search concurrently Memory APE Memory APE Memory APE Assoc.ControlUnit (CU) Memory APE Memory APE Memory APE Associative SIMD Array

Associative Search Ford Taurus $22,000 Kent R APE Assoc.ControlUnit (CU) Chevrolet Malibu $20,000 Akron R APE Ford Taurus $18,000 Akron R APE Ford Focus $14,000 Kent R APE Jeep Wrangler $25,000 Akron R APE Associative PEs (APEs) search for a key, and those that find it are flagged as responders Find all “Ford” cars for sale

Associative Processing Ford Taurus $22,000 Kent R APE Assoc.ControlUnit (CU) Chevrolet Malibu $20,000 Akron APE Ford Taurus $18,000 Akron R APE Ford Focus $14,000 Kent R APE Jeep Wrangler $25,000 Akron APE The responders can be processed further, either one, all sequentially, or all in parallel, as needed

PE Interconnection Network Memory APE Memory APE Memory APE Assoc.ControlUnit (CU) Memory APE Memory APE Memory APE Memory APE Memory APE ASC: Associative SIMD Array w/ PE Network

PE Interconnection Network Memory APE Assoc.ControlUnit (CU) CU Interconnection Network Memory APE Memory APE Assoc.ControlUnit (CU) Memory APE Memory APE Memory APE Assoc.ControlUnit (CU) Memory APE Memory APE MASC: ASC + Multiple Control Units / Instruction Streams

Outline of Talk • SIMD Associative Computing • Associative Search & Associative Processing • PE Interconnection Network • Multiple Instruction Streams • ASC Processor (Work Mostly Complete) • Pipelined Architecture • Reconfigurable PE Interconnection Network • Processor and Network Performance • MASC Architecture (Work in Progress) • Implementation of Task Manager and Instruction Stream • Sample Code • Architecture and Sample Execution • Sample Application • String Matching • Conclusion

ASC Processor’s Pipelined Architecture • We have implemented a pipelined SIMD Associative (ASC) Processor using Altera FPGAs • Five single-clock-cycle pipeline stages are split between the SIMD Control Unit (CU) and the PEs • In the Control Unit • Instruction Fetch (IF) • Part of Instruction Decode (ID) • In the Scalar PE (SPE), in each Parallel PE (PPE) • Rest of Instruction Decode (ID) • Execute (EX) • Memory Access (MEM) • Data Write Back (WB)

Control Unit (CU) Parallel PE (PPE) Array Instruction Memory IF/ID Latch Decoder Immediate Data Register File Broadcast Register Data ID/EX Latch EX/MEM Latch Data Memory MEM/WB Latch Sequential PE (SPE) Pipelined ASC Processor with Reconfigurable Interconnection Network

Mask Comparator Data Memory Data Switch ID/EX Latch EX/MEM Latch MEM/WB Latch Register File MUX Processing Element (PE) • Comparator implements associative search, pushes ‘1’ onto top of stack for responders, ‘0’ otherwise • Top of mask of ‘0’ disables ID/EX Latch

Pipelined ASC Processor’s Performance • Our pipelined ASC Processor has been implemented an Altera APEX20KC1000 FPGA with 70 8-bit PEs • Other 8-bit processor cores implemented on this FPGA / speed grade have clock speeds ranging from 30 to 106 MHz, typically 60-68 MHz • Our pipelined ASC Processor has a clock speed of 56.4 MHz, comparable with these other processors • With the 5-stage pipeline, our ASC Processor can approach a peak performance of 300 MHz

Reconfigurable PE Interconnection Network • Our pipelined ASC Processor also has a reconfigurable PE interconnection network • Reconfigurable PE network allows arbitrary PEs in the PE Array to be connected via • Linear array (currently implemented), or • 2D mesh (to be implemented soon) without the restriction of physical adjacency • Each PE in the PE Array can • Choose to stay in the PE interconnection network, or • Choose to stay out of the PE interconnection network, so that it is bypassed by any inter-PE communication

Control Unit (CU) Parallel PE (PPE) Array Instruction Memory IF/ID Latch Decoder Immediate Data Register File Broadcast Register Data ID/EX Latch EX/MEM Latch Data Memory MEM/WB Latch Sequential PE (SPE) Pipelined ASC Processor with Reconfigurable Interconnection Network

Register Data (from SPE) Immediate Data (from CU) Register File Left Neighbor Right Neighbor Data Switch Top ofMask Stack Comparator & ID/EX Latch Reconfigurable Network Implementation • Data switch • Passes register, broadcast, and immediate data to the PE and to its two neighbors • Routes data from the PE’s neighbors to its EX stage • Reconfigurable network — supports Bypass Mode to remove the PE non-responders from the network • Will be needed by MASC Processor

ASC Processor’s Network Performance • Performance of ASC Processor degrades as number of PEs is increased with Bypass Mode present • Due to the long path from the first PE to the last PE in the PE array • 4-PE ASC Processor requires 2152 LEs and runs at 56.4 MHz with Bypass Mode present • When the number of PEs is increased to 50, the clock frequency drops to 22 MHz • In the future we hope to reduce this delay using a pipelined or other multi-hop architecture

Instruction Stream Task Manager IDLE IDLE Wait_For_IS Task_Execution Task_Allocation Call_TM Join

MASC PE Structure IS1 IS2 TM1 TM2 IS_TM_Chooser ID Register PE

Instruction Stream Task Manager IDLE IDLE Wait_For_IS Task_Execution IS ID Task_Allocation TM ID Call_TM IS ID Join

Assembly Code Example . . 101 Parallel_Select_Start Mem(110) • Pcase Condition1 Mem(104) • Pcase Condition2 Mem(107) • Case1 • … • Parallel_Case_End • Case 2 • … • Parallel_Case_End • Parallel_Select_End (note: This does not trigger JOIN, lack of tasks do) . .

TM1 IS0 IS1 IS2 TM0 TM2 PE0 PE1 PE2 PE3 PE4 PE5 Task Managers Instruction Streams

TM1 TM0 TM2 Task Managers PE0 PE1 PE2 PE3 PE4 PE5 Originally All PEs listen to IS0 IS1 IS2 IS0 Instruction Streams

PE0 PE1 PE2 PE3 PE4 PE5 When Parallel Select is met, Task Manager takes over PEs TM1 IS0 IS1 IS2 TM2 TM0 Task Managers Instruction Streams 101 Parallel_Select_Start Mem(110)

PE0 PE1 PE2 PE3 PE4 PE5 TM then calls IS0 to perform 1st task TM1 IS1 IS2 TM2 TM0 Task Managers Instruction Streams • Pcase Condition1 Mem(104) IS0 104 Case1 105 …

PE0 PE1 PE2 PE3 PE4 PE5 TM then calls IS1 to perform 2nd task TM1 IS2 TM2 TM0 Task Managers Instruction Streams • Pcase Condition1 Mem(104) IS0 IS1 • Pcase Condition2 Mem(107) 104 Case1 105 … 107 Case 2 108 …

PE0 PE1 PE2 PE3 PE4 PE5 2nd task finishes and gives control back to TM TM1 IS1 IS2 TM2 TM0 Task Managers Instruction Streams • Pcase Condition1 Mem(104) IS0 104 Case1 105 … 107 Case 2 108 … 109 Parallel_Case_End

PE0 PE1 PE2 PE3 PE4 PE5 1st task finishes and gives control back to TM TM1 IS1 IS2 TM2 TM0 Task Managers Instruction Streams 104 Case1 105 … 106 Parallel_Case_End

TM1 TM0 TM2 Task Managers PE0 PE1 PE2 PE3 PE4 PE5 Control is back to the last finished IS which is IS0 IS1 IS2 IS0 Instruction Streams 110 Parallel_Select_End . .

PE0 PE1 PE2 PE3 PE4 PE5 IS1 meets a nested parallel select code IS1 IS2 TM2 TM0 Task Managers Instruction Streams IS0 TM1

PE0 PE1 PE2 PE3 PE4 PE5 TM1 allocates the two tasks to IS1 and IS2 Common Register TM2 A = 2 TM0 Task Managers Instruction Streams IS0 TM1 IS1 IS2 B = A C = A

Sample Application — String Matching • One of the most fundamental computing operations • Variable Length Don’t Care (VLDC) SIMD associative algorithm (Mary Esenwein 1997 and Ping Xu 2005) can find all instances of a pattern string within a larger text string • Exact-match version shown here • Extensions support single-character and variable-length “don’t cares” • Demonstrates associative search, associative computing (responder processing), and the linear PE interconnection network

String Match by Associative Computing text$ counter$ match$ patt_counter 1 @ 0 0 R APE 0 Assoc.ControlUnit (CU) 2 A 0 0 R APE patt_length 2 3 B 0 0 R APE patt_string 4 A 0 0 R APE AB 5 A 0 0 R APE Look for match of pattern string AB in text string ABAA Initialize variables as shown above Note that “$” indicates a parallel variable

String Match by Associative Computing text$ counter$ match$ patt_counter 1 @ 0 0 R APE 0 Assoc.ControlUnit (CU) 2 A 0 0 R APE patt_length 2 3 B 0 0 R APE patt_string 4 A 0 0 R APE AB 5 A 0 0 R APE j Responders are text$ == patt_string[j] and counter$ == patt_counter;

String Match by Associative Computing text$ counter$ match$ patt_counter 1 @ 0 0 APE 0 Assoc.ControlUnit (CU) 2 A 0 0 APE patt_length 2 3 B 0 0 R APE patt_string 4 A 0 0 APE AB 5 A 0 0 APE j Responders are text$ == patt_string[j] and counter$ == patt_counter;

String Match by Associative Computing text$ counter$ match$ patt_counter 1 @ 0 0 APE 0 1 Assoc.ControlUnit (CU) 2 A 0 1 0 APE patt_length 2 3 B 0 0 R APE patt_string 4 A 0 0 APE AB 5 A 0 0 APE j Responders add 1 to counter$ and send result to counter$ of preceding cell via network;patt_counter++;

String Match by Associative Computing text$ counter$ match$ patt_counter 1 @ 0 0 R APE 1 Assoc.ControlUnit (CU) 2 A 1 0 R APE patt_length 2 3 B 0 0 R APE patt_string 4 A 0 0 R APE AB 5 A 0 0 R APE j Responders are text$ == patt_string[j] and counter$ == patt_counter;

String Match by Associative Computing text$ counter$ match$ patt_counter 1 @ 0 0 APE 1 Assoc.ControlUnit (CU) 2 A 1 0 R APE patt_length 2 3 B 0 0 APE patt_string 4 A 0 0 APE AB 5 A 0 0 APE j Responders are text$ == patt_string[j] and counter$ == patt_counter;

String Match by Associative Computing text$ counter$ match$ patt_counter 1 @ 0 2 0 APE 1 2 Assoc.ControlUnit (CU) 2 A 1 0 R APE patt_length 2 3 B 0 0 APE patt_string 4 A 0 0 APE AB 5 A 0 0 APE j Responders add 1 to counter$ and send result to counter$ of preceding cell via network;patt_counter++;

String Match by Associative Computing text$ counter$ match$ patt_counter 1 @ 2 0 R APE 2 Assoc.ControlUnit (CU) 2 A 1 0 R APE patt_length 2 3 B 0 0 R APE patt_string 4 A 0 0 R APE AB 5 A 0 0 R APE Responders are counter$ == patt_length;

String Match by Associative Computing text$ counter$ match$ patt_counter 1 @ 2 0 R APE 2 Assoc.ControlUnit (CU) 2 A 1 0 APE patt_length 2 3 B 0 0 APE patt_string 4 A 0 0 APE AB 5 A 0 0 APE Responders are counter$ == patt_length;

String Match by Associative Computing text$ counter$ match$ patt_counter 1 @ 2 0 R APE 2 Assoc.ControlUnit (CU) 2 A 1 0 1 APE patt_length 2 3 B 0 0 APE patt_string 4 A 0 0 APE AB 5 A 0 0 APE Responders send 1 to match$ of next cell via network;

String Match by Associative Computing text$ counter$ match$ patt_counter 1 @ 2 0 APE 2 Assoc.ControlUnit (CU) 2 A 1 1 R APE patt_length 2 3 B 0 0 APE patt_string 4 A 0 0 APE AB 5 A 0 0 APE Responders are match$ == 1; Indicates cell(s) where match of pattern string ABin text string ABAA begins

Conclusion • We have implemented a SIMD associative ASC Processor (on an FPGA) that combines the parallelism of SIMD architectures with the search capabilities of associative computing • Performance is improved by adding a 5-stage pipeline, split between the Control Unit and the PEs • Additional functionality is provided by a reconfigurable PE interconnection network • Future work will include • Support for multiple Control Units (in progress) • Performance improvement to support more efficient broadcast to a large number of PEs

Hong Wang & Robert A. Walker Computer Science Department Kent State University Kent, OH 44242 USA