1 / 79

The Trimedia CPU64 VLIW Media Processor

The Trimedia CPU64 VLIW Media Processor. Kees Vissers Philips Research Visiting Industrial Fellow vissers@eecs.berkeley.edu. Outline. Introduction Application Domain Processor Architecture Retargetable compiler Design Space Exploration Results. SDRAM. Serial I/O. video-in.

weylin
Télécharger la présentation

The Trimedia CPU64 VLIW Media Processor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Trimedia CPU64 VLIWMedia Processor Kees Vissers Philips Research Visiting Industrial Fellow vissers@eecs.berkeley.edu

  2. Outline • Introduction • Application Domain • Processor Architecture • Retargetable compiler • Design Space Exploration • Results

  3. SDRAM Serial I/O video-in PCI bridge video-out timers I2C I/O audio-out I$ VLIW cpu audio-in D$ Design Problem

  4. Application Domain • High volume consumer electronics productsfuture TV, home theatre, set-top box, etc. • Embedded core • Media processing:audio, video, graphics, communication

  5. Media Processing CPU Design considerations: • Cost (embedded in high volume products) • Performance for the application domain • Ease of use (programming model)  Benchmark suite needed

  6. Application: Natural Motion D/2 n-1 -D/2 n-1/2 picture number n

  7. Machine Description BenchmarkApplications Compiler Simulator Performance Numbers Project Approach: Y-chart

  8. Benchmark Suite Characteristics • Each application: typical for a class of applications within application domain • The set covers a significant area of the application domain • Each benchmark is sufficiently well tuned to the architecture to measure its performance

  9. The Benchmark Suite

  10. Processing Characteristics • Signal ratesaudio, video (at block and pixel rate) • Basic data typesbyte (8), half-words (16), words (32), float (32) • Data access patternssample stream, bitstream, random access • Data independent and dependent load • Control processing and signal processing

  11. Algorithm Reference C TM1000 optimized CPU64 optimized knowledge simulation IC exploration video video Application Development

  12. Initial Design Considerations • Goal: 6-8 times TM1000 performance • Standard ANSI-C, reuse of code • Utilize instruction and data parallelism • Limited complexity (embedded core) • Compatibility through recompilation • VLIW architecture

  13. Instruction Set Considerations • A machine operation must be sufficiently generic within the application domain • Sufficiently powerful operations • Limited number of operations • Consistency and orthogonality

  14. 7 7 TM1000 cycles x 10 CPU64 cycles x 10 Results: Natural Motion Averagegain: 2.7x(cycles) 15 15 UPC 10 10 SEG SEG instructions 5 5 MED UPC nops SAD SUB cache stalls MED SUB SAD 0 0 0 20 40 60 80 100% 0 20 40 60 80 100%

  15. Natural Motion Dynamic Load 150 cpu load (%) tm1000 @ 125MHz 100 90% 50 cpu64 @ 300MHz 16% 0 10 20 30 40 50 60 70 80 90 100 110 Field number

  16. Summary • Application domain: • Media processing for CE industry • Benchmark suite: • Targeted to application domain • Optimized for class of processors • Future products: • Gradual shift to software implementations of signal processing functions

  17. Outline • Introduction • Application Domain • Processor Architecture • Retargetable compiler • Design Space Exploration • Results

  18. Application Target • Processor core, to be embedded in different ICs and products • Real time processing of media streams • Cost-sensitive consumer electronics market

  19. SDRAM Serial I/O video-in PCI bridge video-out timers I2C I/O audio-out I$ VLIW cpu audio-in D$ Embedded Application

  20. Performance Target Relative to TriMedia TM1000 (5-slot VLIW,100MHz, 32-bit datapath, media operations): • 6x to 8x performance increase,to process more or higher-resolution video streams • Not more than double transistor count

  21. Efficiency Good performance/silicon cost ratio by: • Optimize CPU architecture and media benchmark source code towards each other • Solve resource conflicts at compile time

  22. Outline • Introduction • VLIW architecture • Instruction set • IDCT example • Results

  23. Architectural Speedup Not simply increasing the VLIW issue width: • Diminishing gain of compiler-generated ILP • Increasing implementation complexity(area, timing)

  24. Architectural Speedup • Extended SIMD: uniform 64-bit design,vectors of 1-, 2-, and 4-byte elements(data throughput per cycle) • New, extensive, media instruction set(functionality per cycle) • Improved cache control(prefetch, alloc)

  25. 32-bit peripheral bus 64-bit memory bus multi-port 128 words x 64 bits register file bypass network datacache16KB mmu mmu FU FU FU FU FU exceptions PC VLIW instruction decode and launch instruction cache 32 KB mmu CPU64 Architecture

  26. Code Example int a[n], b[n], c[n]; int i; for (i=0; i<n; i++) c[i]=a[i]+b[i]; RISC instructions: VLIW instructions (5 issueslots): 1 x = *a 1 x = *a y = *b i + = 1 a += 1 b += 12 y = *b 2 g = i<n nop nop nop nop3 i += 1 3 jmpf g 7 jmpt g 1 nop nop nop4 g = i<n 4 z = x+y nop nop nop nop5 z = x+y 5 *c = z c+=1 nop nop nop6 *c = z 6 nop nop nop nop nop7 jmpf g 128 jmpt g 19 a += 110 b += 111 c += 1

  27. VLIW challenges • Achieve high Instruction Level Parallelism • Compiler needs to know the pipeline for instructions • Code size • Design of a multi-ported register file. • Design of the forwarding network

  28. Instruction Level Parallelism • Loop unrolling, software pipelining • Guarded execution • Avoid branches • if conversion, • hand rewrite source!, valid in Embedded applications, done for TI, Philips and Apple software (Adobe fotoshop). • Speculate branches

  29. Compiler • Detect data dependency at compile time: • examples: c[i]=a[i]+b[i]; potential dependency d[i]=a[i]+c[j]; c[i] might be c[j] c[1]=a[i]+b[i]; no dependency d[i]=a[i]+c[2]; c[1] is never c[2]

  30. Compiler • Schedule the instructions in the proper slot at the proper time, e.g.: • load and stores only in slot 1,2 • adds in all slots • floating point multiply only in slot 3 • add take 1 cycle, mult takes 2 cycles • 3 branch delay slots

  31. Code size • Large number of registers cost code size • add r1 r3 r127 • Large number of media instructions: code size • Use hardware compression techniques to remove nops • Variable length instruction format! • Vector instructions: SIMD, great for data processing performance

  32. Code Example char a[n], b[n], c[n]; int i; for (i=0; i<n; i++) c[i]=a[i]+b[i]; : VLIW instructions (5 issueslots): 1 x = *a y = *b i + = 1 a += 1 b += 1 2 g = i<n nop nop nop nop 3 jmpf g 7 jmpt g 1 nop nop nop 4 z = x+y nop nop nop nop 5 *c = z c+=1 nop nop nop 6 nop nop nop nop nop

  33. Code Example Vector vec64sb a[n/8], b[n/8], c[n/8]; int i; for (i=0; i</8n; i++) c[i]=sb_add(a[i],b[i]); : VLIW instructions and SIMD (5 issueslots): 1 x = ld64b a y =ld64b b i + = 1 a += 8 b += 8 2 g = i<fraction n nop nop nop nop 3 jmpf g 7 jmpt g 1 nop nop nop 4 z = add64sb x y nop nop nop nop 5 st64b c z c+=1 nop nop nop 6 nop nop nop nop nop

  34. x7 x6 x5 x4 x3 x2 x1 x0 y7 y6 y5 y4 y3 y2 y1 y0 - - - - - - - - |.| |.| |.| |.| |.| |.| |.| |.| S 0 z Vector Instruction Example Sum of absolute differences (“ub_me”)

  35. Application Code Example int calc_sad(vec64ub *prv, vec64ub *cur, int s) { vec64ub left, right; int i; int sad = 0; for (i=0; i<8; i++) { left = prv[i*s]; right = cur[i*s]; sad += ub_me(left, right); } return(sad); }

  36. Architecture VLIW+SIMD • Issues a 5-slot instruction every cycle • Each slot supports a selection of FUs • All FUs support vectorized (SIMD) data • Double-slot FU allows powerful multi-argument, multi-result operations • All FUs are pipelined, latency 1 to 4(except floating point divide and sqrt)

  37. Instruction Format Per operation, per slot: • Up to 2 arguments, up to 1 result • Optionally an extra register for guarding (conditional execution) • Immediate argument size can be 32 bit Instructions have compressed variable length format, decompressed during decode

  38. Operations Intends to cover combinations of: • 1-, 2-, 4-byte elements in an 8-byte vector • Signed or unsigned type • Clipped versus wrap-around arithmetic • Set of basic functions(ld, st, add, mul, compare, shift, ...)

  39. Instruction Set

  40. 1 1 1 1 + + + + Example: msp-multiply Each argument: 4-way 16-bit int Multiply to internal double precision integer Round lower half into upper half Return upper half

  41. a b c d e f g h i j k l m n o p a e i m b f j n Example: Transpose 2-slot ‘super-op’:takes 4 arguments, shuffles bytes,produces 2 results (2-dimensional filtering)

  42. + + + + + + + + + + + + + + + + Example: 4-way average add with extended precision 1 1 round (MPEG 2-dimensional half-pixel average)

  43. Branches • Branch operations have 3 branch delay slots • Compiler & scheduler try to fill these(profiling, loop unrolling, inlining, guarding) • Branch units are properly pipelined • Up to 3 branch ops in 1 instruction • No branch prediction hardware • Branches are preferred moments for interrupt servicing

  44. 2D-IDCT Example • IDCT code was tested early in the project, also for studying the programming model • Mapped through our experimental C compiler and instruction scheduler • Operates entirely on (vectors of) 16-bit data elements • Simulation showed IEEE 1180 accuracy compliancy

  45. OK OK OK OK 2D-IDCT Instructions Execution time in cycles, excluding data cache misses VLIW: CPU64, TM1000, TI 320C60 others superscalar = accuracy compliant

  46. Status Now in construction at Philips Semiconductors, Sunnyvale • 7M transistors (target) • 300Mhz clock (target) • 0.18 technology • 1.8 volt

  47. Conclusion Processor Architecture • The 5-slot 64-bit VLIW provides a powerful architecture for media processing • Performance gain of 6x to 8x on TM1000 • Rich and regular media instruction set • Powerful multi-argument, multi-result ‘super-op’ naturally fits VLIW architecture

  48. Outline • Introduction • Application Domain • Processor Architecture • Retargetable compiler • Design Space Exploration • Results

  49. App 1 Exe 1A Exe 1B App 2 Exe 2A Exe 2B App 3 Exe 3A Exe 3B Traditional Compilation Compiler A Machine A Compiler B Machine B

  50. App 1 Exe 1A Exe 1B App 2 Exe 2A Exe 2B App 3 Exe 3A Exe 3B Retargetable Compilation Machine ADescription Machine A Compiler Machine B Machine BDescription

More Related