1 / 31

Implementing An Associative Processor on FPGAs

Implementing An Associative Processor on FPGAs. Memory. PE. Instruction Stream Control Unit. Memory. PE. Cell Interconnection Network. Memory. PE. Memory. PE. A Conceptual View of the KSU ASC Model. 1 Burgundy Focus OH 170 ……. PE0.

davida
Télécharger la présentation

Implementing An Associative Processor on FPGAs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Implementing An Associative Processor on FPGAs

  2. Memory PE Instruction Stream Control Unit Memory PE Cell Interconnection Network Memory PE Memory PE A Conceptual View of the KSU ASC Model

  3. 1 Burgundy Focus OH 170 ……. PE0 2 Blue Taurus OH 160 ……. PE1 3 Blue Focus OH 190 ……. PE2 4 Red Focus PA 180 ……. PE3 ID Color Model State Rebate An Example for the Data Memory Organization:Auto Information Stored in the PE Cells

  4. The Prototype of the Byte-serial ASC Processor IS Control Unit Associative Processing Array CPU (for Sequential and parallel instructions) PE Array 32-bit Instruction Memory Data Memory Responder Resolution Circuitry MAX/MIN Circuitry 16 8-bit Common Registers

  5. PE Cell 0 PE Cell 1 PE Cell 2 PE Cell 3 Data Memory0 Data Memory1 Data Memory2 Data Memory3 PE0 PE1 PE2 PE3 Responder Resolution Circuitry At_Least_One_Responder MAX/MIN Circuitry Prototype of the 4-PE Associative Processing Array

  6. 8-bit ALU 1-bit ALU CarryOut MUX MUX Logical Registers Common Registers General-Purpose Registers Comparator 16 8-bit General Purpose Registers 16 1-bit Logical Registers 16-deep 1-bit Mask Stack 1-bit Responder Register Find/Step /ResolveFirst A Processing Element Overview

  7. Instruction Set and Assembling Language (1) • Data Transfer Instructions - LD address, dstreg - LDI immediate, dstreg - LDRR srcreg, dstreg - LDRRSPD srcreg - ST srcreg, address • Arithmetic and Logical Instructions (mnemonic srcreg1, srcreg2, dstreg) • ADD SUB • AND OR XOR NOT • SLL SRL • SLT SLE SGT SGE SEQ SNE

  8. Instruction Set and Assembling Language (2) • Mask Stack and Responder Instructions • SETMSK • TOPMSK TOPMSKRSPD • POPMSK POPMSKRSPD • POPTHEM POPTHEMRSPD • RPCMSK RPCMSKRSPD • PUSHMSK PUSHMSKRSPD • PUSHTHEM • PUSHMSKTHEM • STKTOMEM MEMTOSTK • FIND • STEP • RESFST

  9. Instruction Set and Assembling Language (3) • Maximum and Minimum Searching Instructions • SETMXMI • LDMXMI • STMXMI • MAX • MIN • Branch/Jump Instructions • BNR • BRS • J

  10. Associative Operations Related PE Components: • The Responder Register: to indicate whether a PE is a responder to a particular associative search or not • The Step/Find/ResolveFirst Unit: to support processing multiple responders in various ways • The Mask Stack: to represent at most 16 levels of association. The top of the Mask Stack always represents the current status of the PE – whether it is masked (‘1’) or unmasked (‘0’)

  11. Example of Associative Search:Find all Focus cars located in Ohio • Perform the comparison: model == “Focus”, and store the result either ‘1’ or ‘0’ into $LR1 • Perform the comparison: location == “Ohio”, and store the result into $LR2 • AND $LR1 with $LR2, and store the result into the Responder Register (Note: all the instructions above performed by all PEs in parallel are called unmasked instructions)

  12. Unmasked and Masked Instructions • Unmasked Instruction: Executed by all the PEs regardless of the state of the Mask Stack • Masked Instruction: Executed only by those PEs with a ‘1’ on the top of their Mask Stack

  13. Example of Associative Search Using Masked Instructions:(Find all Focus cars located in Ohio ) • Initialize the top of the Mask Stack to ‘1’ • Perform the comparison: model == “Focus”, and store the result ‘1’ or ‘0’ into $LR1 • Perform the comparison: location == “Ohio”, and store the result into $LR2 • AND $LR1 with $LR2, and store the result into the Responder Register • AND the Responder Register with the top of the Mask Stack, and push the ANDing result into the Mask Stack and also store it into the Responder Register • Increase the rebate of all Focus cars in Ohio by 10 (masked instruction)

  14. Responder Resolution PE 0 From PE0 From PE1 From PE2 From PE3 R0 V0 R1 V1 R2 V2 R3 V3 V4 to PE0 to PE1 to PE2 to PE3 To CU Step /Find /RslvFst Mask Stack clr Responder MAX/MIN From PE0 : GPR RPD From PE1: GPR RPD From PE2: GPR RPD From PE3: GPR RPD to PE0 to PE1 to PE2 to PE3 D0 MM0 R0 D1 MM1 R1 D2 MM2 R2 D3 MM3 R3 General Purpose Registers The MAX/MIN Circuitry, the Responder Resolution Circuitry, and PE3

  15. Using the Falkoff Algorithm for MAX/MIN Search Maximum-Value Searching (the following steps are performed in parallel for all the data) • Search bit slices of the data from the most significant bit to the least significant bit: As each bit slice is processed, each bit is ANDed with a corresponding MM bit (a 1-bit register used to indicate whether or not a data item is the maximum after processing a bit) • Check the results of the AND to ensure that at least one new maximum value remains:

  16. Using the Falkoff Algorithm for MAX/MIN Search (continued) If this condition is true, then the MM bits are updated by the results of AND; if all the results are 0, then the MM bits are not updated at this time • Continue to process the remaining bit slices as above until all bits are processed • After the least significant bit slice is processed: If only one MM bit is ‘1’, it marks the largest number; if more than one MM bit is ‘1’, those data are tied for the maximum value

  17. Using the Falkoff Algorithm for MAX/MIN Search (continued) Minimum-Value Searching: • Similar to maximum value searching, but complement the bit slices each time before ANDing it with MM bits

  18. Search For the Maximum Rebate in the Data Memories Bit Slices (7..0) of Rebates Values in MM bits During Processing Process bit from MSB to LSB After processing each bit (rebate) 76543210 Initialize 7 6 5 4 3 2 1 0 (170) 10101010 (MM0) 1 1 1 1 0 0 0 0 0 (160) 10100000 (MM1) 1 1 1 1 0 0 0 0 0 (190) 10111110 (MM2) 1 1 1 1 1 1 1 1 1 (max) (180) 10110100 (MM3) 1 1 1 1 1 0 0 0 0

  19. OP Data 0 to RPD0 RPD0 Data 1 to RPD1 RPD1 Data 2 RDP 2 to RPD2 Data 3 RPD 3 to RPD3 Mask_W 8-bit shift register0 “not” MM0 8-bit shift register1 “not” MM1 8-bit shift register2 “not” MM2 8-bit shift register3 “not” MM3 MAX/MIN Circuit using the Falkoff Algorithm

  20. Responder Resolution PE 0 From PE0 From PE1 From PE2 From PE3 R0 V0 R1 V1 R2 V2 R3 V3 V4 to PE0 to PE1 to PE2 to PE3 To CU Step /Find /RslvFst Mask Stack clr Responder MAX/MIN From PE0 : GPR RPD From PE1: GPR RPD From PE2: GPR RPD From PE3: GPR RPD to PE0 to PE1 to PE2 to PE3 D0 MM0 R0 D1 MM1 R1 D2 MM2 R2 D3 MM3 R3 General Purpose Registers The MAX/MIN Circuitry, the Responder Resolution Circuitry, and PE3

  21. Functionality of Responder Resolution Circuit • Responder resolution: Send an At-Least-One-Responder signal to the IS control unit • Support responder selection: Send a corresponding Responder_Before_Me signal to each PE’s Find_ Step _ResolveFirst unit

  22. Responder Resolution Circuitry ‘0’ PE0 PE1 PE2 PE3 The Responder Resolution Circuitry for 4 PEs V0 R0 V1 R1 V2 V4 R2 V3 R3 R0 to R3 : from responder registers V0 to V3 : called Responder_Before_ME V4 : called At_Least_One_Responder

  23. Responder Processing • Process responders in parallel: • use masked instructions • Process responders sequentially: • Need some responder selection instructions • Need a responder selection mechanism

  24. Responder Selection Instructions • Step repetitively used to pick one responding PE each time for further processing – “ for” loop e.g., to step through all the Focus cars in Ohio to list the features available on each car • Find select a responding PE, while still keeping all responders identifiable – “ while” loop e.g., retrieve the tax rate from one of the cars located in OH, then increment the tax rate by a certain amount, afterwards apply this new tax rate to all the cars located in OH

  25. Responder Selection Instructions (continued) • ResolveFirst select a responder and only keep this responder identifiable e.g., resolve one PE from several PEs which have the values tied for the maximum value

  26. Responder Resolution PE 3 From PE0 From PE1 From PE2 From PE3 R0 V0 R1 V1 R2 V2 R3 V3 V4 to PE0 to PE1 to PE2 to PE3 To CU Step /Find /RslvFst Mask Stack clr Responder MAX/MIN From PE0 : GPR RPD From PE1: GPR RPD From PE2: GPR RPD From PE3: GPR RPD to PE0 to PE1 to PE2 to PE3 D0 MM0 R0 D1 MM1 R1 D2 MM2 R2 D3 MM3 R3 General Purpose Registers The Responder Resolution Circuitry, MAX/MIN Circuitry, and PE3

  27. Design Language: VHDL • A standard hardware description language used to model and design digital hardware - Support concurrent events - can be translated into hardware by some design tools • good for managing large design structures • Supported by many CAD tool and programmable logic vendors

  28. Design Verification Simulator Waveform Editor Timing Analysis Other Verification Tools Design Entry Graphic Editor Text Editor Waveform Editor Symbol Editor Floorplan Editor Other Design Entry Tools MAX+PLUS II Compiler Design Compilation Programmer Data I/O Other Programmers Device Programming Altera MAX+PLUS II Development System

  29. FLEX10K70 Device: - 3,744 LEs - 9 EABs - 70,000 gates totally Altera FLEX 10K FPLD ( IOEs – I/O elements) Partial FLEX10K20 FPLD Architecture IOEs IOEs IOEs IOEs IOEs IOEs LAB EAB LAB FastTrack Interconnect IOEs IOEs EAB LAB LAB IOEs IOEs IOEs IOEs IOEs IOEs

  30. Simulation on FLEX 10K 70 Chip • The ISCU runs at about 10MHz using 50% logical gates • One EAB is used as a local memory for one PE; 4 PEs and the support circuit runs at about 14MHz using 82% logical cells. From the simulation result, we can see that the FLEX 10K 70 chip isn’t large enough for the 4-PE processor. So our current work is targeting on Altera APEX 20K devices with 1million gates in one chip.

  31. Future Work • Explore more arithmetic features and associative operations • Develop the complete ASC assembly language and the ASC back-end compiler • Implement the PE cell interconnection network • Implement the whole ASC processor on bigger and faster FPGA chips • Develop the multiple instruction stream MASC model

More Related