1 / 15

Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology

Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology. Speaker: Rong Jiang, Xu Jin Instructor: Yu-Hen Hu. Outline. Introduction MMX/SSE/SSE2 MPEG 2 Video Compression What we have done? Conclusion. MMX/SSE/SSE2. MMX 57 new instructions; 8 64-bit wide MMX registers;

braima
Télécharger la présentation

Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology Speaker: Rong Jiang, Xu Jin Instructor: Yu-Hen Hu

  2. Outline • Introduction • MMX/SSE/SSE2 • MPEG 2 Video Compression • What we have done? • Conclusion

  3. MMX/SSE/SSE2 • MMX • 57 new instructions; • 8 64-bit wide MMX registers; • 4 new data types. (3 packed data type and 1 64-bit entity) • SSE • 8 new 128-bit SIMD floating-point registers; • 50 new instructions that work on packed floating-point data; • 8 new instructions to control data cacheability; • 12 new instructions that extend the MMX instruction set. • SSE2 • Support 64-bit floating-point values

  4. MPEG 2 video compression

  5. 1. Dig out a MPEG2 Enc/Dec C code 2. Generate profiling information 3. Identify the kernels 4. Rewrite kernels using SSE 5. Performance results Project outline

  6. Profiling results of the original code mpeg2decode mpeg2encode idct() dist1() fdct()

  7. Example 1 – optimizing dist1() if ((v = p1[0] - p2[0])<0) v = -v; s+= v; if ((v = p1[1] - p2[1])<0) v = -v; s+= v; if ((v = p1[2] - p2[2])<0) v = -v; s+= v; if ((v = p1[3] - p2[3])<0) v = -v; s+= v; if ((v = p1[4] - p2[4])<0) v = -v; s+= v; if ((v = p1[5] - p2[5])<0) v = -v; s+= v; if ((v = p1[6] - p2[6])<0) v = -v; s+= v; if ((v = p1[7] - p2[7])<0) v = -v; s+= v; if ((v = p1[8] - p2[8])<0) v = -v; s+= v; if ((v = p1[9] - p2[9])<0) v = -v; s+= v; if ((v = p1[10] - p2[10])<0) v = -v; s+= v; if ((v = p1[11] - p2[11])<0) v = -v; s+= v; if ((v = p1[12] - p2[12])<0) v = -v; s+= v; if ((v = p1[13] - p2[13])<0) v = -v; s+= v; if ((v = p1[14] - p2[14])<0) v = -v; s+= v; if ((v = p1[15] - p2[15])<0) v = -v; s+= v; asm volatile (" movdqu (%1), %%XMM0 movdqu (%2), %%XMM1 psadbw %%XMM0, %%XMM1 movdq2q %%XMM1, %%MM0 pslldq $8, %%XMM1 movdq2q %%XMM1, %%MM1 paddd %%MM1, %%MM0 movd %%MM0, %0" : "=r"(s) : "r"(p1), "r"(p2)); 4-5X speed-up, but it can be faster! This code segment is for calculating residual matrices in the prediction stage in Encoder

  8. Four ways to write super-fast code • Rearrange data fetching to maximize cache hit; • Unroll loops to eliminate unnecessary branches; • Utilize SSE instructions to take full advantage of parallelism; • Apply code scheduling to exploit multiple issue capability of Pentium 4's superscalar micro- architecture.

  9. Example 2 – optimize idct() Three nested loops forms the kernel of DCT: for (i=0; i<8; i++) for (j=0; j<8; j++) { partial_product = 0.0; for (k=0; k<8; k++) partial_product+= c[k][j]*block[i][k]; tmp[i][j] = partial_product; }

  10. A verbatim translation from C to assembly doesn’t do much better. It misses the whole point of manually writing an assembly procedure.

  11. We need parallelism!

  12. Results 68.72% 50.1s 25X in idct() 4X in dist1() 34.39% 16.34s 13.04% 9.99% 2.45s 3.83s Experimental Results are averaged over 3 runs.

  13. Platform Compatibility (1) Algorithm for Checking Availability of MMX bool isMMXSupported() { int fSupported; asm { mov eax,1 // CPUID level 1 cpuid // EDX = feature flag and edx,0x800000 // test bit 23 of feature flag mov fSupported,edx // != 0 if MMX is supported} if (fSupported != 0) return true; else return false; }

  14. Y SSE? SSE Routine N MMX Routine MMX? Y N Normal Routine END Platform Compatibility (2) Algorithm for Checking Availability of SSE bool isISSESupported() { int processor; int features; int extfeatures = 0; asm{ pusha mov eax,1 cpuid mov processor,eax // Store processor family/model/step mov features,edx // Store features bits mov eax,080000000h cpuid // Check which extended functions can be called cmp eax,080000001h // Extended Feature Bits jb nofeatures // Jump if not supported mov eax,080000001h // Select function 0x80000001 cpuid mov extfeatures,edx // Store extended features bits nofeatures: popa } if (((features $>>$ 25) \& 1) != 0) return true; else if (((extfeatures $>>$ 22) \& 1) != 0) return true; else return false; }

  15. Thank you!

More Related