Compiler++ Evolving the compiler - C2.DLL

Compiler++ Evolving the compiler - C2.DLL Jim Radigan - Architect C++ Optimizer

Mission:Evolving the C++ compiler

Evolve the red arrow $87.7 B • 1. ~Absolute Correctness • 2. Compiler throughput • 3. Code size • 4. Code quality $100 .0B +

3,100,000 Transistors

Ivy Bridge 1.4 Billion Transistors

TEGRA 3- 5 cores / 128 bit vector instructions

Haswell C++

Built with C++ • Windows SQL Office • Mission critical correctness and compile time

Compiler++ “Evolving the compiler” • How we work • Core Technologies • Where we are going

Full compile, test build Windows – N hours 24 cores + 32 Gb memory 3 raid 0 drives

… if you’re in a hurry – 40 cores

X86, ARM, X64 - retail and checked

N Applications - then stress a compiler’s build

Compiler developer – bad day

Win8 improved– but still a work/life balance thing

Compiler++ “Evolving the compiler” • How we work • Core Technologies • Where we are going

“Compiler Business” • Absolutely NO new compiler optimization switches • Each switch would cost millions $$

Core Technologies • Code size / stack size / data alignment • Vectorization/Parallelization of existing C++ • Security • Parallelizing C++ control flow • Alias analysis • FOR ALL HARDWARE & RUNTIMES!!

Code Size / Stack Size • Foo (int p1, int p2, int p3) { • intw,x,y,z • …. • if (flag) { • w = • x = w + z • … • return x • } • else { • y = • } [ebp+10] Parameter 3 [ebp+0C] Parameter 2 [ebp+08] Parameter 1 [ebp+04] Return address [ebp+00] Old ebp [ebp -04] Local 1 // w [ebp -08] Local 2 // x [ebp -0C] Local 3 // z or y

Stack Packing

Its all about… CACHE LINES

NTSTATUS • NtfsCommonRead ( • PIRP_CONTEXT IrpContext, • PIRP Irp, • BOOLEAN AcquireScb • ){ • NTSTATUS Status; • PIO_STACK_LOCATION IrpSp; • PFILE_OBJECT FileObject; • TYPE_OF_OPEN TypeOfOpen; • PVCB Vcb; • PFCB Fcb; • PSCB Scb; • PCCB Ccb; • ATTRIBUTE_ENUMERATION_CONTEXT AttrContext; • EOF_WAIT_BLOCK EofWaitBlock; • PFSRTL_ADVANCED_FCB_HEADER Header; • PTOP_LEVEL_CONTEXT TopLevelContext; • VBO StartingVbo; • LONGLONG ByteCount; • LONGLONG ByteRange; • ULONG RequestedByteCount; • PCOMPRESSION_SYNC CompressionSync = ((void *)0); • BOOLEAN FoundAttribute = 0; • BOOLEAN PostIrp = 0; • BOOLEAN OplockPostIrp = 0; • BOOLEAN ScbAcquired = 0; • BOOLEAN ReleaseScb; • BOOLEAN PagingIoAcquired = 0; • BOOLEAN DoingIoAtEof = 0; • BOOLEAN Wait; • BOOLEAN PagingIo; • BOOLEAN NonCachedIo; • BOOLEAN SynchronousIo; • BOOLEAN CompressedIo = 0;

ROOT • __try { • NtfsPrePostIrp( IrpContext, Irp ); • if (( (((Fcb->FcbState) & ((0x00000004)))) ) && • ( (((Scb->ScbState) & ((0x00000010)))) )) { • FsRtlPostPagingFileStackOverflow( IrpContext, Event, NtfsStackOverflowRead ); • } else { • FsRtlPostStackOverflow( IrpContext, Event, NtfsStackOverflowRead ); • } • (void) KeWaitForSingleObject( Event, Executive, KernelMode, 0, ((void *)0) ); • Status = ((NTSTATUS)0x00000103L); • } __finally { • if (Resource != ((void *)0)) { • (ExReleaseResourceLite(Resource)); • } • ExFreeToNPagedLookasideList( &NtfsKeventLookasideList, Event ); • } • } else { • if (Irp->Tail.Overlay.AuxiliaryBuffer != ((void *)0)) { • IrpContext->Union.AuxiliaryBuffer = • (PFSRTL_AUXILIARY_BUFFER)Irp->Tail.Overlay.AuxiliaryBuffer; • if (!( (((IrpContext->Union.AuxiliaryBuffer->Flags) & (0x00000001))) )) { • Irp->Tail.Overlay.AuxiliaryBuffer = ((void *)0); • } • } • Status = NtfsCommonRead( IrpContext, Irp, 1 ); • } • break; • } • __except (NtfsExceptionFilter( IrpContext, (struct _EXCEPTION_POINTERS *)_exception_info() )) { • NTSTATUS ExceptionCode; • ExceptionCode= _exception_code(); • if (ExceptionCode == ((NTSTATUS)0xC0000123L)) { • IrpContext->ExceptionStatus = ExceptionCode = ((NTSTATUS)0xC0000011L); • Irp->IoStatus.Information = 0; • } • } TRY EXCEPT TRY FINALLY

Try Region Graph – asynchronous lifetimes int x, y; _try { _try { x = } _finally { } = x + … y = _except (filter()) { = y } ROOT TRY = x EXCEPT TRY X = FINALLY

Recall …Compiler dev. primary concern

C++ Core Technologies • Code size / stack size / data alignment • Vectorization/Parallelization of existing C++ • Security • Parallelizing C++ control flow • Alias analysis

C++ Compiler - Auto Parallelism

Vector - all loads before all stores “addps xmm1, xmm0 “ xmm0 xmm1 + xmm1

Simple vector add loop - unaligned for (i = 0; i <1000/4; i++){ movps xmm0, [ecx] movps xmm1, [eax] addps xmm0, xmm1 movps [edx], xmm0 } for (i = 0; i < 1000; i++) A[i] = B[i] + C[i]; Compiler looks across loop iterations !

Auto Parallelism/Vectorization for C++ • For ( iv1= 0; iv1 <= U1; iv1++) • For ( iv2= 0; iv2 <= U2; iv2++) • ... • For (ivn= 0; ivn <= Un; ivn++) • t13 = OPLOAD [ a1*iv1+ a2 *iv2 + ... an* ivn+ sym_expression] • } • } • }

Math in the compiler- Legal to vectorize ? FOR ( j = 2; j <= 5; j++) A( j ) = A (j-1) + A (j+1) Not Equal !! A (2:5) = A (1:4) + A (3:7) A(3) = ?

Vector Semantics • ALL loads before ALL stores A (2:5) = A (1:4) + A (3:7) VR1 = LOAD(A(1:5)) VR2 = LOAD(A(3:7)) VR3 = VR1 + VR2 // A(3) = F (A(2) A(4)) STORE(A(2:5)) = VR3

Vector Semantics • Instead - load store load store ... FOR ( j = 2; j <= 257; j++) A( j ) = A( j-1 ) + A( j+1 ) A(2) = A(1) + A(3) A(3) = A(2) + A(4) // A(3) = F ( A(1)A(2)A(3)A(4) ) A(4) = A(3) + A(5) A(5) = A(4) + A(6) …

Doubled the optimizer A ( a1 * I + c1 ) ?= A ( a2 * I’ + c2)

for (size_t j = 0; j < numBodies; j++) { D3DXVECTOR4 r; r.x = A[j].pos.x - pos.x; r.y = A[j].pos.y - pos.y; r.z = A[j].pos.z - pos.z; float distSqr = r.x*r.x + r.y*r.y + r.z*r.z; distSqr += softeningSquared; float invDist = 1.0f / sqrt(distSqr); float invDistCube = invDist * invDist * invDist; float s = fParticleMass * invDistCube; acc.x += r.x * s; acc.y += r.y * s; acc.z += r.z * s; } Legal math ? Complex C++ Not just arrays!

Legal ? Where’s the base of the array? void foo(int n, float *a, float *b, float *c){ for (int j=0; j<n; j++) { *a++ = *b++ + *c++; } }

…and where’s the IV? A ( a1 * I + c1 ) ?= A ( a2 * I’ + c2) • void • transform1(int* first1, int* last1, int* first2, int* result) { • while (first1 != last1) { • *result++ = *first1++ + *first2++; • } • } STL – source code

Parallelizing C++ requires transformation to analyze

while (first1 != last1) { *result++ = *first1++ + *first2++; } intsynthetic_i; intsynthetic_upper = (last1 – first1 + 4)/4; for (synthetic_i = 0; synthetic_i < synthetic_upper; synthetic_i++) { result[synthetic_i] = first1[synthetic_i] + first2[sythetic_i]; } STL – source code

Now …C++ vector code gen • We don’t know if the array bases overlap • We don’t know what the target ISA is • We don’t know if the trip count is divisible by 4

if ( ! overlap (result, first1) && ! overlap(result ,first2)) • if (_ISA_AVAILABLE(AVX2)) { • for (i= 0; i< synthetic_upper/4; i+= 4) { // Vector + Parallel Loop • result[i : i +3] = first1[i : i + 3] + first2[i : i +3]; • } • j = synthetic_upper/4 • } • } • for (j = 0; j < synthetic_upper; i++) { // Sequential or cleanup loop • result[j] = first1[j] + first2[j]; • }

Maps C++ to all forms of Parallelism • Vector • Vector + Parallel • SPMD

Don’t BSOD…its all about life style choices

Heap overflow vulnerability • HRESULT CDocManager::IsValidWMToolsStream(bool* pfValid) • { • long cbSize; • if(FAILED(hr = ExtractDataSize(strPath, &cbSize))) • return S_OK; • CSmartPtr<BYTE> pBuffer = new BYTE[cbSize]; • ExtractData(strPath, pBuffer, cbSize); • long dwCheckSum = DwChecksumFromLpvCb(0, pBuffer, cbSize); • long dwStreamCnt = GetStreamCount(m_pVisitedTree); • if(FAILED(hr = ExtractDataSize(kszCheckSumStream, &cbSize))) { • return S_OK; • } • //ExtractData(kszCheckSumStream, pBuffer, cbSize); • for(int i=0; i<cbSize; i++) { • *pBuffer++ = *kszCheckSumStream++; • } • } 1. cbSize assigned 4470 2. allocate buffer with 4470 bytes 3. cbSize re-assigned 4496 Heap Overflow! Leads to Hijack

IE Aurora - Dangling pointer vulnerability 2. Copy evt, but fail to AddRef on CTreeNode! <html><head><script> var e1; function f1(evt){ e1 = document.createEventObject(evt); document.getElementById("sp").innerHTML = ""; window.setInterval(f2, 50); } function f2(){ var t = e1.srcElement; } </script></head> <body> <span id="sp"> <imgsrc=“any.gif" onload=“f1(evt)"> </span> </body></html> 3. Destroy img tag in span leading to a free when evt falls out of scope 4. Call f2 async so evt goes out of scope Hijack! Vtable call via freed CTreeNode 1. Pass onload event (evt) to f1 • Red is C++ called from javascript

Vulnerability: “use after free” heap pointer vtable attack data function_1 function_2 attack data attack code attack code attack code attack data

Illegal - flow or writes • What if the C++ compiler generated code to check? • It would have to always be on • NOT degrade performance !! Example for : Hardware + Language + Compiler co-design

Compiler++ Evolving the compiler - C2.DLL