Warp Processors Project for Innovative Speed and Energy Gains

Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Task ID: 1331.001 July 2005 – June 2008 Ph.D. students: Greg Stitt – Ph.D. June 2007, now Asst. Prof. at Univ. of Florida, Gainseville Ann Gordon-Ross – Ph.D. June 2007, now Asst. Prof. at Univ. of Florida, Gainseville David Sheldon Ph.D. expected 2009 Scott Sirowy Ph.D. expected 2010 Industrial Liaisons: Brian W. Einloth, Motorola Dave Clark, Darshan Patra, Intel Jeff Welser, Scott Lekuch, IBM

Task Description • Warp processing background • Idea: Invisibly move binary regions from microprocessor to FPGA  10x speedups or more, energy gains too • Task– Mature warp technology • Years 1/2 • Automatic high-level construct recovery from binaries • In-depth case studies (with Freescale) • Warp-tailored FPGA prototype (with Intel) • Years 2/3 • Reduce memory bottleneck by using smart buffer • Investigate domain-specific-FPGA concepts (with Freescale) • Consider desktop/server domains (with IBM) Frank Vahid, UCR

µP FPGA Binary “Translation” Binary Background • Motivated by commercial dynamic binary translation of early 2000s Performance µP VLIW VLIW Binary e.g., Transmeta Crusoe “code morphing” x86 Binary Binary Translation • Warp processing(Lysecky/Stitt/Vahid 2003-2007): dynamically translate binary to circuits on FPGAs Frank Vahid, UCR

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Warp Processing Background 1 Initially, software binary loaded into instruction memory Profiler I Mem µP D$ FPGA On-chip CAD Frank Vahid, UCR

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 µP Warp Processing Background 2 Microprocessor executes instructions in software binary Profiler I Mem µP D$ FPGA On-chip CAD Frank Vahid, UCR

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Critical Loop Detected Warp Processing Background 3 Profiler monitors instructions and detects critical regions in binary Profiler Profiler I Mem µP µP beq beq beq beq beq beq beq beq beq beq add add add add add add add add add add D$ FPGA On-chip CAD Frank Vahid, UCR

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Warp Processing Background 4 On-chip CAD reads in critical region Profiler Profiler I Mem µP µP D$ FPGA On-chip CAD On-chip CAD Frank Vahid, UCR

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 reg3 := 0 reg4 := 0 loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 Warp Processing Background 5 On-chip CAD decompiles critical region into control data flow graph (CDFG) Profiler Profiler I Mem µP µP D$ Recover loops, arrays, subroutines, etc. – needed to synthesize good circuits FPGA Dynamic Part. Module (DPM) On-chip CAD Decompilation surprisingly effective at recovering high-level program structures Stitt et al ICCAD’02, DAC’03, CODES/ISSS’05, ICCAD’05, FPGA’05, TODAES’06, TODAES’07 Frank Vahid, UCR

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 + + + + + + . . . + + + . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp Processing Background 6 On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit Profiler Profiler I Mem µP µP D$ FPGA Dynamic Part. Module (DPM) On-chip CAD Frank Vahid, UCR

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 SM SM SM SM SM SM CLB CLB + + + + + + . . . SM SM SM SM SM SM + + + . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp Processing Background 7 On-chip CAD maps circuit onto FPGA Profiler Profiler I Mem µP µP D$ FPGA FPGA Dynamic Part. Module (DPM) On-chip CAD + + Lean place&route/FPGA 10x faster CAD (Lysecky et al DAC’03, ISSS/CODES’03, DATE’04, DAC’04, DATE’05, FCCM’05, TODAES’06) Multi-core chips – use 1 powerful core for CAD Frank Vahid, UCR

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 SM SM SM SM SM SM CLB CLB + + + + + + . . . SM SM SM SM SM SM + + + Software-only “Warped” FPGA . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp Processing Background >10x speedups for some apps On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more 8 Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 Profiler Profiler I Mem µP µP D$ FPGA FPGA Dynamic Part. Module (DPM) On-chip CAD + + Frank Vahid, UCR

FPGA µP FPGA On-chip CAD Single-execution speedup Speedup Warp Scenarios Warping takes time – when useful? • Long-running applications • Scientific computing, etc. • Recurring applications (save FPGA configurations) • Common in embedded systems • Might view as (long) boot phase Long Running Applications Recurring Applications µP (1st execution) On-chip CAD µP Time Time Possible platforms: Xilinx Virtex II Pro, Altera Excalibur, Cray XD1, SGI Altix, Intel QuickAssist, ... Frank Vahid, UCR

f() f() µP On-chip CAD f() f() Acc. Lib Thread Warping - Overview for (i = 0; i < 10; i++) { thread_create( f, i ); } Multi-core platforms  multi-threaded apps Performance OS schedules threads onto accelerators (possibly dozens), in addition to µPs Compiler Very large speedups possible – parallelism at bit, arithmetic, and now thread level too µP µP FPGA Binary f() OS schedules threads onto available µPs µP µP µP f() OS OS invokes on-chip CAD tools to create accelerators for f() Thread warping: use one core to create accelerator for waiting threads Remaining threads added to queue Frank Vahid, UCR

FPGA On-chip CAD µP Thread Functions Decompilation Hw/Sw Partitioning Sw Hw Memory Access Synchronization Binary Updater High-level Synthesis Thread Group Table Updated Binary Netlist FPGA Thread Warping Tools • Invoked by OS • Uses pthread library (POSIX) • Mutex/semaphore for synchronization • Defined methods/algorithms of a thread warping framework Thread Queue Thread Functions Thread Counts Queue Analysis Accelerator Library false false Not In Library? Accelerators Synthesized? Done true true Memory Access Synchronization Accelerator Instantiation Accelerator Synthesis Accelerator Synthesis Bitfile Netlist Place&Route Schedulable Resource List Thread Group Table Updated Binary Frank Vahid, UCR

b() a() Memory Access Synchronization (MAS) • Must deal with widely known memory bottleneck problem • FPGAs great, but often can’t get data to them fast enough for (i = 0; i < 10; i++) { thread_create( thread_function, a, i ); } RAM DMA Data for dozens of threads can create bottleneck void f( int a[], int val ) { int result; for (i = 0; i < 10; i++) { result += a[i] * val; } . . . . } FPGA …. Same array • Threaded programs exhibit unique feature: Multiple threads often access same data • Solution: Fetch data once, broadcast to multiple threads (MAS) Frank Vahid, UCR

f() f() f() enable (from OS) Memory Access Synchronization (MAS) 1) Identify thread groups – loops that create threads 2) Identify constant memory addresses in thread function • Def-use analysis of parameters to thread function 3) Synthesis creates a “combined” memory access • Execution synchronized by OS Data fetched once, delivered to entire group Thread Group DMA RAM for (i = 0; i < 100; i++) { thread_create( f, a, i ); } A[0-9] A[0-9] A[0-9] A[0-9] ……………… Def-Use: a is constant for all threads void f( int a[], int val ) { int result; for (i = 0; i < 10; i++) { result += a[i] * val; } . . . . } Before MAS: 1000 memory accesses After MAS: 100 memory accesses Addresses of a[0-9] are constant for thread group Frank Vahid, UCR

f() f() f() enable Memory Access Synchronization (MAS) • Also detects overlapping memory regions – “windows” • Synthesis creates extended “smart buffer” [Guo/Najjar FPGA04] • Caches reused data, delivers windows to threads ……… a[0] a[1] a[2] a[3] a[4] a[5] for (i = 0; i < 100; i++) { thread_create( thread_function, a, i ); } Data streamed to “smart buffer” DMA RAM void f( int a[], int i ) { int result; result += a[i]+a[i+1]+a[i+2]+a[i+3]; . . . . } A[0-103] Smart Buffer A[0-3] A[6-9] A[1-4] ……………… Each thread accesses different addresses – but addresses may overlap Buffer delivers window to each thread W/O smart buffer: 400 memory accesses With smart buffer: 104 memory accesses Frank Vahid, UCR

Thread Queue Thread Functions Thread Counts Queue Analysis Accelerator Library false false Not In Library? Accelerators Synthesized? Done true true Accelerator Instantiation Accelerator Synthesis Bitfile Netlist Place&Route Schedulable Resource List Thread Group Table Updated Binary FPGA Framework • Also developed initial algorithms for: • Queue analysis • Accelerator instantiation • OS scheduling of threads to accelerators and cores Frank Vahid, UCR

Thread Warping Example int main( ) { . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, i ); } . . . . . . } void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } µP µP FPGA main() filter() filter() threads execute on available cores µP µP µP On-chip CAD filter() OS Thread Queue Remaining threads added to queue Queue Analysis OS invokes CAD (due to queue size or periodically) Thread functions: filter() CAD tools identify filter() for synthesis Frank Vahid, UCR

Example MAS detects thread group int main( ) { . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, i ); } . . . . . . } void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } µP µP FPGA main() filter() µP µP µP On-chip CAD filter() OS filter() binary CAD reads filter() binary Decompilation MAS detects overlapping windows CDFG Memory Access Synchronization Frank Vahid, UCR

filter filter() filter() filter + + + 2 >> Example Accelerators loaded into FPGA int main( ) { . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . . } void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } µP µP FPGA main() filter() µP µP µP On-chip CAD filter() OS filter() binary Synthesis creates pipelined accelerator for filter() group: 8 accelerators Decompilation RAM CDFG Smart Buffer Memory Access Synchronization . . . . . High-level Synthesis Stored for future use Accelerator Library RAM Frank Vahid, UCR

filter() filter() filter filter Example OS schedules threads to accelerators int main( ) { . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . . } void filter( int a[53], int b[50], int i, ) { b[i][=avg( a[i], a[i+1], a[i+2], a[i+3] ); } µP µP FPGA main() filter() µP µP µP On-chip CAD filter() OS Smart buffer streams a[] data RAM a[0-52] enable (from OS) Smart Buffer After buffer fills, delivers a window to all eight accelerators a[9-12] a[2-5] . . . . . RAM Frank Vahid, UCR

filter filter filter() filter() Example int main( ) { . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . . } void filter( int a[53], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } µP µP FPGA main() filter() µP µP µP On-chip CAD filter() OS RAM a[0-53] Each cycle, smart buffer delivers eight more windows – pipeline remains full Smart Buffer a[17-20] a[10-13] . . . . . RAM Frank Vahid, UCR

filter filter() filter() filter Example int main( ) { . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . . } void filter( int a[53], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } µP µP FPGA main() filter() µP µP µP On-chip CAD filter() OS RAM a[0-53] Smart Buffer . . . . . b[2-9] Accelerators create 8 outputs after pipeline latency passes RAM Frank Vahid, UCR

filter filter filter() filter() Example int main( ) { . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . . } void filter( int a[53], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] ); } µP µP FPGA main() filter() µP µP µP On-chip CAD filter() OS RAM a[0-53] Thread warping: 8 pixel outputs per cycle Smart Buffer Software: 1 pixel output every ~9 cycles . . . . . Additional 8 outputs each cycle 72x cycle count improvement b[10-17] RAM Frank Vahid, UCR

Experiments to Determine Thread Warping Performance: Simulator Setup Parallel Execution Graph (PEG) – represents thread level parallelism int main( ) { . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . . } main ………… filter filter filter main Simulation Summary …… Generate PEG using pthread wrappers 1) Nodes: Sequential execution blocks (SEBs) Edges: pthread calls Determine SEB performances Sw: SimpleScalar Hw: Synthesis/simulation (Xilinx) 2) Optimistic for Sw execution (no memory contention) Pessimistic for warped execution (accelerators/microprocessors execute exclusively) Event-driven simulation – use defined algoritms to change architecture dynamically 3) 4) Complete when all SEBs simulated Observe total cycles Frank Vahid, UCR

Experiments • Benchmarks: Image processing, DSP, scientific computing • Highly parallel examples to illustrate thread warping potential • We created multithreaded versions • Base architecture – 4 ARM cores • Focus on recurring applications (embedded) • TW: FPGA running at whatever frequency determined by synthesis Multi-core Thread Warping 4 ARM11 400 MHz 4 ARM11 400 MHz + FPGA (synth freq) µP µP FPGA µP µP Compared to µP On-chip CAD µP µP µP Frank Vahid, UCR

Speedup from Thread Warping • Average 130x speedup But, FPGA uses additional area So we also compare to systems with 8 to 64 ARM11 uPs – FPGA size = ~36 ARM11s • 11x faster than 64-core system • Simulation pessimistic, actual results likely better Frank Vahid, UCR

FPGA On-chip CAD µP Why Dynamic? • Static good, but hiding FPGA opens technique to all sw platforms • Standard languages/tools/binaries Dynamic Compiling to FPGAs Static Compiling to FPGAs Specialized Language Any Language Specialized Compiler Any Compiler Binary Netlist Binary FPGA µP • Can adapt to changing workloads • Smaller & more accelerators, fewer & large accelerators, … • Can add FPGA without changing binaries – like expanding memory, or adding processors to multiprocessor • Custom interconnections, tuned processors, … Frank Vahid, UCR

Expandable Logic – Warp tools detect amount of FPGA, invisibly adapt application to use less/more hardware. Expandable RAM – System detects RAM during start, improves performance invisibly RAM DMA FPGA FPGA Cache Cache Profiler FPGA FPGA µP µP Warp Tools Warp Processing Enables Expandable Logic Concept RAM Expandable Logic Expandable RAM uP Planning MICRO submission Performance Frank Vahid, UCR

Expandable Logic • Used our simulation framework • Large speedups – 14x to 400x (on scientific apps) • Different apps require different amounts of FPGA • Expandable logic allows customization of single platform • User selects required amount of FPGA • No need to recompile/synthesize Frank Vahid, UCR

Dynamic enables Custom Communication NoC – Network on a Chip provides communication between multiple cores Problem: Best topology is application dependent App1 µP µP Bus Mesh App2 µP µP Bus Mesh Frank Vahid, UCR

FPGA FPGA µP µP µP µP µP µP µP µP µP µP µP µP Dynamic enables Custom Communication NoC – Network on a Chip provides communication between multiple cores Problem: Best topology is application dependent App1 FPGA Bus Mesh App2 Bus Mesh Warp processing can dynamically choose topology Frank Vahid, UCR

Industrial Interactions Year 2 / 3 • Freescale • Research visit: F. Vahid to Freescale, Chicago, Spring’06. Talk and full-day research discussion with several engineers. • Internships –Scott Sirowy, summer 2006 in Austin (also 2005) • Intel • Chip prototype: Participated in Intel’s Research Shuttle to build prototype warp FPGA fabric – continued bi-weekly phone meetings with Intel engineers, visit to Intel by PI Vahid and R. Lysecky (now prof. at UofA), several day visit to Intel by Lysecky to simulate design, ready for tapout. June’06–Intel cancelled entire shuttle program as part of larger cutbacks. • Research discussions via email with liaison Darshan Patra (Oregon). • IBM • Internship: Ryan Mannion, summer and fall 2006 in Yorktown Heights. Caleb Leak, summer/fall 2007. • Platform: IBM’s Scott Lekuch and Kai Schleupen 2-day visit to UCR to set up Cell development platform having FPGAs. • Technical discussion: Numerous ongoing email and phone interactions with S. Lekuch regarding our research on Cell/FPGA platform. • Several interactions with Xilinx also Frank Vahid, UCR

Patents • “Warp Processor” patent • Filed with USPTO summer 2004 • Granted winter 2007 • SRC has non-exclusive royalty-free license Frank Vahid, UCR

Year 1 / 2 publications • New Decompilation Techniques for Binary-level Co-processor Generation. G. Stitt, F. Vahid. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2005. • Fast Configurable-Cache Tuning with a Unified Second-Level Cache. A. Gordon-Ross, F. Vahid, N. Dutt. Int. Symp. on Low-Power Electronics and Design (ISLPED), 2005. • Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode. G. Stitt, F. Vahid, G. McGregor, B. Einloth. International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005. (Co-authored paper with Freescale) • Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware.A. Gordon-Ross and F. Vahid. IEEE Trans. on Computers, Special Issue- Best of Embedded Systems, Microarchitecture, and Compilation Techniques in Memory of B. Ramakrishna (Bob) Rau, Oct. 2005. • A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation. R. Lysecky, F. Vahid and S. Tan. IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), 2005. • A First Look at the Interplay of Code Reordering and Configurable Caches. A. Gordon-Ross, F. Vahid, N. Dutt. Great Lakes Symposium on VLSI (GLSVLSI), April 2005. • A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning. R. Lysecky and F. Vahid. Design Automation and Test in Europe (DATE), March 2005. • A Decompilation Approach to Partitioning Software for Microprocessor/FPGA Platforms. G. Stitt and F. Vahid. Design Automation and Test in Europe (DATE), March 2005. Frank Vahid, UCR

Year 2 / 3 publications • Warp Processing: Dynamic Translation of Binaries to FPGA Circuits. F. Vahid, G. Stitt, and R. Lysecky.. IEEE Computer, 2008 (to appear). • C is for Circuits: Capturing FPGA Circuits as Sequential Code for Portability. S. Sirowy, G. Stitt, and F. Vahid. Int. Symp. on FPGAs,2008. • Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators. G. Stitt and F. Vahid.. Int. Conf. on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2007, pp. 93-98. • A Self-Tuning Configurable Cache. A. Gordon-Ross and F. Vahid. Design Automation Conference (DAC), 2007. • Binary Synthesis. G. Stitt and F. Vahid. ACM Transactions on Design Automation of Electronic Systems (TODAES), Aug 2007. • Integrated Coupling and Clock Frequency Assignment. S. Sirowy and F. Vahid. International Embedded Systems Symposium (IESS), 2007. • Soft-Core Processor Customization Using the Design of Experiments Paradigm. D. Sheldon, F. Vahid and S. Lonardi. Design Automation and Test in Europe, 2007. • A One-Shot Configurable-Cache Tuner for Improved Energy and Performance. A Gordon-Ross, P. Viana, F. Vahid and W. Najjar. Design Automation and Test in Europe, 2007. • Two Level Microprocessor-Accelerator Partitioning. S. Sirowy, Y. Wu, S. Lonardi and F. Vahid. Design Automation and Test in Europe, 2007. • Clock-Frequency Partitioning for Multiple Clock Domains Systems-on-a-Chip. S. Sirowy, Y. Wu, S. Lonardi and F. Vahid • Conjoining Soft-Core FPGA Processors. D. Sheldon, R. Kumar, F. Vahid, D.M. Tullsen, R. Lysecky. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2006. • A Code Refinement Methodology for Performance-Improved Synthesis from C. G. Stitt, F. Vahid, W. Najjar. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2006. • Application-Specific Customization of Parameterized FPGA Soft-Core Processors. D. Sheldon, R. Kumar, R. Lysecky, F. Vahid, D.M. Tullsen. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2006. • Warp Processors. R. Lysecky, G. Stitt, F. Vahid. ACM Transactions on Design Automation of Electronic Systems (TODAES), July 2006, pp. 659-681. • Configurable Cache Subsetting for Fast Cache Tuning. P. Viana, A. Gordon-Ross, E. Keogh, E. Barros, F. Vahid. IEEE/ACM Design Automation Conference (DAC), July 2006. • Techniques for Synthesizing Binaries to an Advanced Register/Memory Structure. G. Stitt, Z. Guo, F. Vahid, and W. Najjar. ACM/SIGDA Symp. on Field Programmable Gate Arrays (FPGA), Feb. 2005, pp. 118-124. Frank Vahid, UCR

Warp Processors Project for Innovative Speed and Energy Gains

Warp Processors Project for Innovative Speed and Energy Gains

Presentation Transcript

Warp Knitting Basics

Warp Speed: Executing Time Warp on 1,966,080 Cores

East Midlands WARP

WARP

SUPPLEMENTARY WARP/WEFT

Warp films

Processors

Warp Processors

Warp Processors (a.k.a. Self-Improving Configurable IC Platforms)

Warp Processors (a.k.a. Self-Improving Configurable IC Platforms)

OS/2 Warp

Warp Processors

ARP/wARP developments

Processors

WARP Managed Service Platform (WARP-MSP)

PROCESSORS

The Warp Processor

Time Warp

Warp Speed

MPI/WARP

Warp Processors Towards Separating Function and Architecture

Warp Processors Towards Separating Function and Architecture