Download
optimization of the quant component for speed n.
Skip this Video
Loading SlideShow in 5 Seconds..
Optimization Of The Quant Component For Speed PowerPoint Presentation
Download Presentation
Optimization Of The Quant Component For Speed

Optimization Of The Quant Component For Speed

97 Vues Download Presentation
Télécharger la présentation

Optimization Of The Quant Component For Speed

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Optimization Of The Quant Component For Speed as part of the seminar “Software Streaming Architecture“ Volker Martens

  2. Why Performance Tuning? • decrease waiting time while decoding • gain of 1 s per image • unimportant for one image • 90 min video (25 f/s) : 37.5 min • measure time or clock cycles • tmsim: hard to measure time => cycles used Volker Martens

  3. How To Measure Clock Cycles? • TriMedia custom operators • example • long start = CYCLES(); • ... • long end = CYCLES(); • printf(“This code used %d clock cycles“, end-start); • disadvantages: • increases total number of cycles • has to change sourcecode • nested measurements possible • TriMedia compiler • tmsim : runs program and saves execution statistics in <statfile> • tmsim -statfile <statfile> <executable prog.> • tmprof: generates report for each function • tmprof -scale 1 -func <statfile> <executable prog.> Volker Martens

  4. The Start Situation - used functions in Quant.c and tmalQuant.c: Function Executions Total Cycles (%) --------------- ---------- ---------------- _QuantizeIntraDCTcoefMPEG 288 2969052 28.56 _CopyBlockFromFrame 288 684750 6.59 _checkrange 18144 362909 3.49 _DC_Scaler 576 51138 0.49 _QuantizeIntraDCCoef 288 39453 0.38 _QuantMacroblock 48 27507 0.26 _tmalQuantProcessData 1 14355 0.14 _tmalQuantStart 1 2332 0.02 ----------------------------------------------------- total/average 60784 10396474 100.00 - total clock cycles over all functions Volker Martens

  5. Forms Of Performance Tuning (1) • Profile driven compilation • 1. compile with profiling code : tmcc -p <sourcefile> -o <outputfile> • 2. generate profile information : tmsim <outputfile> • 3. recompile using profile information: tmcc -r <sourcefile> -o <outputfile> • compiler performs loop unrolling and restricted pointers • changes in sourcecode require new profile • -G also performs grafting Volker Martens

  6. Forms Of Performance Tuning (2) • loop optimization • remove IF and function calls • loop fusion • using cheaper operators • replace && and || by & resp. | • ... • using custom operators • special operations for DSP applications • manual loop unrolling • best for the most critical parts • using restricted pointers • tell compiler that pointers are not overlapping • ... Volker Martens

  7. Performed Optimizations (1) QuantizeIntraDCTcoefMPEG (1) int checkrange (int x, int cMin, int cMax) { if (x < cMin) return cMin; if (x > cMax) return cMax; return x; } ... iScaledCoef =checkrange (iScaledCoef, -iMaxVal, iMaxVal - 1); iScaledQP = (int) (3.0F * (Float) iQP / 4.0F + 0.5); rgiDCTcoefQ [i] = min(iMaxAC, max(-iMaxAC, iScaledCoef)); • checkrange() called 18144 times : inlining and custom ops. • formula with convertions from int to float and back • calls to min() and max() replaced by custom ops. Volker Martens

  8. Performed Optimizations (2) QuantizeIntraDCTcoefMPEG (2) // old code iScaledCoef =checkrange(iScaledCoef, -iMaxVal, iMaxVal-1); iScaledQP = (int) (3.0F * (Float) iQP / 4.0F + 0.5); rgiDCTcoefQ[i] = min(iMaxAC, max(-iMaxAC, iScaledCoef)); // faster code iScaledCoef =IMIN(iScaledCoef, iMaxVal - 1); iScaledCoef =IMAX(iScaledCoef, -iMaxVal); iScaledQP = (3*iQP+2) >> 2; rgiDCTcoefQ[i] = IMIN(iMaxAC, IMAX(-iMaxAC, iScaledCoef)); - 766.000 C. - 400.000 C. ========== -1.166.000 C. Volker Martens

  9. Performed Optimizations (3) CopyBlockFromFrame (1) for (j=0;j<blocksize;j++) { for (i=0;i<blocksize;i++) { x0 = bx*blocksize + i; y0 = by*blocksize + j; start = y0*xsize + x0; dest[j*blocksize+i] = frame[start]; } } • 1. Loop optimization • overhead reduced: computations from inner loop set before it • 2. Loop unrolling • copy done multiple times and fewer repetitions in inner loop Volker Martens

  10. Performed Optimizations (4) CopyBlockFromFrame (2) int startdest; x0 = bx*blocksize; y0 = by*blocksize; startdest = 0; start = y0*xsize + x0; for (j=0;j<blocksize;j++) { for (i=0;i<blocksize-1;i+=4) { dest[startdest+i] = frame[start+i]; dest[startdest+i+1] = frame[start+i+1]; dest[startdest+i+2] = frame[start+i+2]; dest[startdest+i+3] = frame[start+i+3]; } startdest += blocksize; start += xsize; } - 125.000 C. - 275.000 C. ========= - 400.000 C. Parameter blocksize must be a multiple of 4 ! Volker Martens

  11. Performed Optimizations (5) DC_Scaler If-expression rebuilt - 30.000 C. if ((a >= 1) && (a <= 4)) result = ...; else if ((a >= 5) && (a <= 8)) result = ...; else if ... if (a >= 1) if (a >= 5) if (a >= 9) ... else return ...; else return ...; return -1; QuantizeIntraDCcoef Min() and max() replaced by IMIN and IMAX - 22.000 C. tmalQuantProcessData & DC_Scaler *2 and /2 replaced by << 1 and >> 1 - 20.000 C. ========= - 72.000 C. Volker Martens

  12. Optimization Results (1) Function Executions Total Cycles (%) Total Cycles (%) --------------- ---------- ---------------- ---------------- _QuantizeIntraDCTcoefMPEG 288 2969052 28.56 | 2170540 24.72 _CopyBlockFromFrame 288 684750 6.59 | 280202 3.19 _checkrange 18144 362909 3.49 | - - _DC_Scaler 576 51138 0.49 | 16737 0.19 _QuantizeIntraDCCoef 288 39453 0.38 | 21457 0.24 _QuantMacroblock 48 27507 0.26 | 27594 0.31 _tmalQuantProcessData 1 14355 0.14 | 14371 0.16 _tmalQuantStart 1 2332 0.02 | 2424 0.03 ------------------------------------------------------------------------------ total/average 60784 10396474 100.00 8780103 100.00 original functions optimized functions Only functions from Quant.c and tmalQuant.c Volker Martens

  13. Optimization Results (1) • -38.0% cycles in optimized functions • -15.5% cycles over all functions Volker Martens