1 / 14

ECE 734 VLSI Array Structures for Digital Signal Processing

Topic: Implementation of JPEG 2000 component algorithm—DWT in TI TMS32060. Team Members: Peng Zhang and Xun Zhang Advisor: Yu Hen Hu. ECE 734 VLSI Array Structures for Digital Signal Processing. Spring 2004. Agenda. Abstract DWT C implementation

janus
Télécharger la présentation

ECE 734 VLSI Array Structures for Digital Signal Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topic: Implementation of JPEG 2000 component algorithm—DWT in TI TMS32060 Team Members: Peng Zhang and Xun Zhang Advisor: Yu Hen Hu ECE 734 VLSI Array Structures for Digital Signal Processing Spring 2004

  2. Agenda • Abstract • DWT C implementation • DWT TMS320 C62 Assembly Code • Without optimization • Speed optimization • Pipeline optimization (by us) • Result comparison • Jpeg 2000 and DWT (if we have free time)

  3. In this project, we would like to implement and optimize DWT algorithm ,which is used as a key algorithm in JPEG2000, on TI TMS320C62 platform. 1st Step, we implemented 2D DWT algorithm by C code; 2nd Step, we implemented 2D DWT algorithm at TI TMS320C62 platform 2 times, without any optimization and with the fastest speed optimization; 3rd Step, we did advanced optimization to assembly code, mainly used pipeline; 4th Step, we compare the performance between before and after our optimization. Abstract Spring 2004

  4. ... #define S(i) a[x*(i)*2] ... void dwt_deinterleave(int *a, int n, int x) { int dn, sn, i; int *b; dn=n/2; sn=(n+1)/2; b=(int*)malloc(n*sizeof(int)); for (i=0; i<sn; i++) b[i]=a[2*i*x]; ... } /// Forward wavelet tranform in 1-D. void dwt_encode_1(int *a, int n, int x) { ... dwt_deinterleave(a, n, x); } /// Forward wavelet tranform in 2-D. void dwt_encode(int *a, int w, int h, int l) { int i, j, rw, rh; for (i=0; i<l; i++) { rw=int_ceildivpow2(w, i); rh=int_ceildivpow2(h, i); for (j=0; j<rw; j++) dwt_encode_1(a+j, rh, w); ... } } void main() { ... dwt_encode(image[0], 200, 165, 8); ... } C code Implementation Spring 2004

  5. ;----------------------------------------------------------------------;---------------------------------------------------------------------- ; 24 | void dwt_deinterleave(int *a, int n, int x) ;---------------------------------------------------------------------- _dwt_deinterleave: ;** --------------------------------------------------------------------------* ... ;---------------------------------------------------------------------- ; 31 | for (i=0; i<sn; i++) ;---------------------------------------------------------------------- ZERO .D2 B4 ; |31| STW .D2T2 B4,*+SP(24) ; |31| LDW .D2T2 *+SP(24),B5 ; |31| LDW .D2T2 *+SP(20),B4 ; |31| NOP 4 CMPLT .L2 B5,B4,B0 ; |31| [!B0] B .S1 L2 ; |31| NOP 5 ; BRANCH OCCURS ; |31| L1: .line 9 ; 32 | b[i]=a[2*i*x]; ;---------------------------------------------------------------------- LDW .D2T2 *+SP(24),B4 ; |32| LDW .D2T2 *+SP(12),B5 ; |32| LDW .D2T2 *+SP(4),B6 ; |32| NOP 2 ADD .D2 B4,B4,B4 MPYLH .M2 B5,B4,B8 ; |32| MPYLH .M2 B4,B5,B7 ; |32| MPYU .M2 B5,B4,B5 ; |32| ADD .D2 B8,B7,B4 ; |32| SHL .S2 B4,16,B4 ; |32| ADD .S2 B5,B4,B4 ; |32| || LDW .D2T2 *+SP(28),B7 ; |32| LDW .D2T2 *+B6[B4],B4 ; |32| LDW .D2T2 *+SP(24),B5 ; |32| NOP 4 STW .D2T2 B4,*+B7[B5] ; |32| LDW .D2T2 *+SP(24),B4 ; |32| NOP 4 ADD .D2 1,B4,B4 ; |32| STW .D2T2 B4,*+SP(24) ; |32| LDW .D2T2 *+SP(24),B5 ; |32| LDW .D2T2 *+SP(20),B4 ; |32| NOP 4 CMPLT .L2 B5,B4,B0 ; |32| [ B0] B .S1 L1 ; |32| NOP 5 ; BRANCH OCCURS ; |32| ;---------------------------------------------------------------------- ... Assembly Code without any optimization

  6. _dwt_deinterleave: … ;** ------------------------------------------------------------------------ || MV .D2 B4,B11 .line 5 MV .D2 B11,B0 ; |28| SHRU .S2 B0,31,B4 ; |28| ADD .D2 B4,B0,B4 ; |28| SHR .S2 B4,1,B0 ; |28| MV .D2 B0,B12 ; |28| .line 6 ADD .D2 1,B11,B10 ; |29| SHRU .S2 B10,31,B4 ; |29| ADD .D2 B4,B10,B4 ; |29| SHR .S2 B4,1,B4 ; |29| MV .S1X B4,A12 ; |29| .line 7 B .S1 _malloc ; |30| MVKL .S2 RL0,B3 ; |30| SHL .S1X B11,2,A4 ; |30| MVKH .S2 RL0,B3 ; |30| NOP 2 RL0: ; CALL OCCURS ; |30| .line 8 CMPLT .L2 B10,2,B0 [ B0] B .S1 L2 ; |31| MV .D2 B10,B4 [!B0] MV .D1 A4,A3 [!B0] MV .S1 A10,A0 NOP 2 ; BRANCH OCCURS ; |31| ;** --------------------------------------------------------------------------* ;** ----------------------- U$22 = a; ;** ----------------------- U$25 = b; ;** 32 ----------------------- L$1 = K$7>>1; ;** ----------------------- X$4 = x<<3; ;** ----------------------- #pragma MUST_ITERATE(1, 1073741823, 1) .line 9 SHR .S2 B4,1,B0 ; |32| || SHL .S1 A11,3,A6 ;** -----------------------g3: ;** 32 ----------------------- *U$25++ = *U$22; ;** 32 ----------------------- U$22 += X$4; ;** 32 ----------------------- if ( --L$1 ) goto g3; SUB .D2 B0,1,B0 ; |32| L1: [ B0] B .S1 L1 ; |32| || LDW .D1T1 *A0,A5 ; |32| ADD .S1 A6,A0,A0 ; |32| [ B0] SUB .D2 B0,1,B0 ; |32| NOP 2 STW .D1T1 A5,*A3++ ; |32| ; BRANCH OCCURS ; |32| ;** -----------------------------------------------------------------------* ... Assembly Code with speed optimization

  7. Speed optimized code analysis • [ B0] B .S1 L1 ; |32| • || LDW .D1T1 *A0,A5 ; |32| • ADD .S1 A6,A0,A0 ; |32| • [ B0] SUB .D2 B0,1,B0 ; |32| • NOP 2 • STW .D1T1 A5,*A3++ ; |32| • for (i=0; i<sn; i++) b[i]=a[2*i*x]; • Assume sn = n+1 • 6*(n+1) clock cycles are needed

  8. SHR .S2 B4,1,B0 CMPGT .L2 B0,6,B1 [ B1] B .S1 L2 SHL .S1 A10,3,A3 [!B1] SUB .D2 B0,1,B0 NOP 3 ;** --------------------------------------------------------------------------* ... ;** --------------------------------------------------------------------------* L2: ADD .S1 A3,A4,A4 || SUB .D2 B0,7,B0 || LDW .D1T1 *A4,A6 ;** --------------------------------------------------------------------------* L3: ; PIPELINED LOOP PRE-PROCESS MV .S2X A0,B4 || [ B0] B .S1 L4 || ADD .L1 A3,A4,A0 || [ B0] SUB .D2 B0,1,B0 || LDW .D1T1 *A4,A0 ADD .L1 A3,A0,A0 || [ B0] SUB .D2 B0,1,B0 || LDW .D1T1 *A0,A0 || [ B0] B .S1 L4 [ B0] B .S1 L4 || ADD .L1 A3,A0,A0 || [ B0] SUB .D2 B0,1,B0 || LDW .D1T1 *A0,A0 ADD .L1 A3,A0,A0 || [ B0] SUB .D2 B0,1,B0 || LDW .D1T1 *A0,A0 || [ B0] B .S1 L4 MV .S2X A6,B5 || [ B0] B .S1 L4 || ADD .L1 A3,A0,A4 || [ B0] SUB .D2 B0,1,B0 || LDW .D1T1 *A0,A0 ;** --------------------------------------------------------------------------* L4: ; PIPELINED LOOP STW .D2T2 B5,*B4++ || MV .S2X A0,B5 || [ B0] B .S1 L4 || ADD .L1 A3,A4,A4 || [ B0] SUB .L2 B0,1,B0 || LDW .D1T1 *A4,A0 ;** --------------------------------------------------------------------------* L5: ; PIPELINED LOOP PAST-PROCESS MV .S2X A0,B5 || STW .D2T2 B5,*B4++ MV .S2X A0,B5 || STW .D2T2 B5,*B4++ MV .S2X A0,B5 || STW .D2T2 B5,*B4++ MVC .S2 B6,CSR || MV .L2X A0,B5 || STW .D2T2 B5,*B4++ ;** --------------------------------------------------------------------------* MV .S2X A0,B5 || STW .D2T2 B5,*B4++ STW .D2T2 B5,*B4++ ;** --------------------------------------------------------------------------* Assembly Code with pipeline optimization

  9. Pipeline optimized code design • L4: ; PIPELINED LOOP • STW .D2T2 B5,*B4++ • || MV .S2X A0,B5 • || [ B0] B .S1 L4 • || ADD .L1 A3,A4,A4 • || [ B0] SUB .L2 B0,1,B0 • || LDW .D1T1 *A4,A0 • for (i=0; i<sn; i++) b[i]=a[2*i*x]; • Assume sn = n+1 • n+7 clock cycles are needed

  10. optimized code with speed (by C6) vs. optimized code with pipeline (by us) for (i=0; i<sn; i++) b[i]=a[2*i*x]; Assumed sn = n+1; Result: Speed optimized code used 6(n+1) clock cycles Pipeline optimized code used n+7 clock cycles Comparison Spring 2004

  11. DWT Quantizer Entropy Coder JPEG2000 Lossy Image Compression Encoder Spring 2004

  12. 2 2 2 2 2 2 2 H2 H1 H2 H1 H2 H1 Hi 1-Level Wavelet Decomposition (2D DWT) LL Component (Low pass) HL Component (Low pass) Input Image (High pass) LH Component (Low pass) HH Component (High pass) (High pass) Row-wise operations Column-wise operations Filter Decimator x[n] y[n] Keep one out of two pixels Spring 2004

  13. LL HL2 HL1 HL1 LL LH2 HH2 LH1 HH1 LH1 HH1 2D-DWT 2D-DWT Multi-Level Wavelet Decomposition Spring 2004

  14. Thanks! Questions?

More Related