Certifying Compilation for Standard ML in a Type Analysis Framework

3 3 3 4 4 4 5 5 5 3 3 4 4 5 5 3 3 3 4 4 4 5 5 5 Certifying Compilation for Standard ML in a Type Analysis Framework Leaf Petersen Carnegie Mellon University

Motivation Carnegie Mellon University

Types • Types capture facts about programs. • Fact: This procedure expects a 32 bit integer. • Fact: This address points to executable code. • Fact: This data structure was produced here. • Programmers use types: • To keep their facts straight. • Capture and preserve invariants. • To check their facts. • Typechecker verifies truth. • Manage complexity. Carnegie Mellon University

P.o P1:T1 P2 Pn .... Types and Compilers • Compilers use types. • Predict size of data. • Eliminate unnecessary dynamic checks. • Most compilers forget types early. Carnegie Mellon University

P.o : To P1:T1 P2:T2 Pn:Tn .... Type Preserving Compilation • Transform types with program. • Optimize code based on types. • Verify that invariants still hold. • Emit types on object code. Carnegie Mellon University

TILT • Type preserving compiler • Standard ML. • Sparc, Alpha, (now) x86 backends • Perry Cheng, Chris Stone, Leaf Petersen, Dave Swasey, and others. • Intermediate languages are typed • Type based optimizations. • Internal correctness checks. • Generates typed x86 object code (this thesis). Carnegie Mellon University

Why TILT? • Want to compile SML efficiently. • Separate compilation is a must. • Traditional optimizations. • Loop optimizations, CSE, constant folding, and many more. • New challenges for optimization. • Polymorphism, GC, 1st class functions, modules, etc. Carnegie Mellon University

Ptr/non-ptr Example: Unknown Types. • Module interfaces (and polymorphism) introduce unknown types: • Clients compiled against interface • Cannot know what t is (may be instantiated multiple times) • Cannot predict size of value (if sizes vary). • Cannot predict traceability of value. Carnegie Mellon University

Old Solutions • C, C++, Java: No unknown types. • Objects: “partially known” types. • Traditional ML/Lisp compilers: Uniform data representation. • All values are same size (e.g. 32 bits). • Large values (e.g. 64 bit floats) must be boxed. • Traceability dealt with via tagging (e.g. 31 bit ints). Carnegie Mellon University

TILT Solution • Types tell size and traceability of data. • Unknown types are instantiated with known types at runtime. • Most compilers discard types before generating code. • TILT: Keep types at runtime and use them to dynamically determine layout and traceability. Carnegie Mellon University

Type analysis • type Optarray[t] = Typecase[t] of Boxed(Float) => Array64[Float] | _ => Array32[t] • Note: • Optarray[Int] == Array32[Int] • Optarray[a] where a is unknown is dynamic • Constructor for type Optarray? • optarray[t] : int x t -> Optarray[t] Carnegie Mellon University

Type analysis • optarray[t](len : int,init : t) : Optarray[t] = typecase [t] of Boxed (Float) => new_array64[Float](len, unbox(init)) | _ => new_array32[t](len,init) • For statically known types, reduces at compile time • optarray[Int](10,0) = new_array32[Int](10,0) • For unknown types, reduces at runtime Carnegie Mellon University

Type-passing Optimizations • Type analysis: • Enables global representation optimizations in the presence of unknown types. • TILT uses types at runtime for: • Better data-layouts. • Unboxed arrays of 64 bit floats • 32 bit ints • Optimized sum representations • Flatten aggregate arguments into registers. • Mostly tag-free garbage collection. Carnegie Mellon University

There’s more • Types can help with generating efficient code. • But not the end of the story.... Carnegie Mellon University

Mobile Code • Code has become mobile. • May know very little about producer. • Examples: • Web applets. • Grid computing. • Binary installations/upgrades. • Application downloads. • High risk from malicious/wrong code. Carnegie Mellon University

The Certification Problem • Source language safety is checkable. • Typechecker checks the programmers facts. • Raw object code is not checkable. • Safety relies on trust in: • Safety of source language. • Correctness/identity of producer/compiler. • Integrity of the object code. Carnegie Mellon University

Java Approach • Java bytecode • High-level language (almost Java) • Can be typechecked • Interpreted • slow, somewhat complicated • JIT compiled • somewhat faster, quite complicated • Large trusted computing base Carnegie Mellon University

Certified Code • Typed object code • Types certify safety • Code consumer • Does no compiling • Checks that certificate applies (easy) • Small trusted computing base • Several instances exist: • TAL: Typed Assembly Language • PCC: Proof Carrying Code • Many extensions and variations Carnegie Mellon University

Certifying Compilers • Programs in safe languages • Types provide needed annotations • Compiler can emit code with certificate of type/memory safety • Certifying compilers exist for: • Safe subsets of C (TAL & PCC) • Java (PCC) • Now for Standard ML Carnegie Mellon University

Types in Compilation • Types can be used to generate efficient code. • Types can be used to generate certified code. • Want to combine the two paradigms. Carnegie Mellon University

My Thesis Certifying compilation of type analyzing code is feasible for a full modern language such as Standard ML. Carnegie Mellon University

Two compilers • Theoretical compiler • Formal translation • Prove important properties • Guide the implementation • Real compiler • Follows the structure of the theoretical compiler • Targets a real certified code system. Carnegie Mellon University

Theory Carnegie Mellon University

Theoretical compiler • Three languages: • Singleton free MIL • LIL • Idealized TAL (ITAL) • Formal translations: • MIL to LIL • Closure conversion of LIL code • LIL to ITAL Carnegie Mellon University

Languages • Singleton free MIL • Lambda calculus • Syntactic restriction to named form • Type analysis through primitives • LIL • Much more fine-grained than MIL • type and type analysis representation • closure representation • ITAL • Machine language • Idealized TAL • Simplified TAL with LX primitives for type analysis Carnegie Mellon University

Translations • MIL to LIL • Very different type structure • Moderately different term structure • See my dissertation. • Closure conversion • Very standard • LIL to ITAL • Type structure is almost identical • Term structure is very different • Explicit control flow • Binding replaced with state modification Carnegie Mellon University

LIL typing ;;` e :  •  – LIL heap context •  – LIL type context •  – LIL term context • e – LIL expression (named form) •  – LIL type for e Carnegie Mellon University

ITAL typing ;;M` I ok •  – ITAL heap context •  – ITAL type context • M – ITAL register file type • I – ITAL instruction sequence Carnegie Mellon University

Register files • A register file type M maps registers to ITAL types • e.g. M(r) =  • Notation: M{r:} means M with the type of r set to . • Designated stack pointer register sp • M(sp) =  •  describes the stack slots Carnegie Mellon University

LIL to ITAL Translations • || - heap context translation • || - type context translation • || - type translation • Exp e maps to instruction seq I • But what is the translation of a term context? Carnegie Mellon University

Register files • LIL variables occupy ITAL registers (or stack slots) • Hence, the translation of a LIL context is an ITAL register file. • Problem: what register file? • Variables are related to registers via register allocation. Carnegie Mellon University

Register allocation • Previous work builds register allocation into the translation. • Complex and tedious • Unclear how to incorporate real RA (e.g. Graph coloring) • Consequently, toy register allocators are used in formal presentations • Better idea: translate with respect to abstract register allocator. Carnegie Mellon University

Allocator Definition: An allocator A is an object such that: • For every variable x: • A(x) = r or A(x) = sp(i) • frmsz(A) is a natural number • For every LIL typing context  and stack type , ||A = M for some register file type M Carnegie Mellon University

Translation judgment ;;;A,` e : Ã I •  – LIL heap context •  – LIL type context •  – LIL term context • A – Allocator •  – describes stack below frame • I – ITAL instruction sequence • For this talk, I’m ignoring exceptions, other stuff. Carnegie Mellon University

Translation judgment ;;;A[z! r1 , x! r1 , y! r2] ,  ` z = x+y : intÃadd r1,r2 Carnegie Mellon University

Question ;;;A,` e : Ã I • Why should I be well-typed? • Is the equational theory rich enough? • Easy to rely on equations that don’t hold • Want to show soundness: • Each translation maps well-typed terms to well-typed terms. • Doesn’t hold for all allocators: only the good ones. Carnegie Mellon University

Good allocator for  Definition: Let M = ||A. We say that A is a good allocator for  if: • M(sp) = f± such that frmsz(A) = f • |²|A is the empty machine state. • If  = 1, x:, 2 then • A is a good allocator for 1 and 2 • If A(x) = r then ||A = |1,2|A{r:||} • If A(x) = sp(i) then something similar. Carnegie Mellon University

Good allocator for e Definition: An allocator A is a good allocator for an expression e if: • For all derivations of ;; ` e : , A is a good allocator for . • A is a good allocator for all sub-expressions of e. Carnegie Mellon University

Soundness Theorem: If A is a good allocator for e and ;;` e :  and  is a well-formed stack type and ;;;A,` e : Ã I then ||;||;M` I ok where M = ||A Carnegie Mellon University

Benefits of this approach • Theory close to implementation • Register allocation is a parameter • Separates out the mechanism • Concise specification of interface between code gen and RA • Translation isn’t bogged down with algorithmic details of RA Carnegie Mellon University

Downside: completeness • Depends on register allocator • Full completeness doesn’t hold • Possible to show parametric completeness? • Not clear what this means • Worthwhile tradeoff • Formal presentation very close to implementation • In practice: • Soundness is hard (implementation had bugs). • Completeness is just a matter of covering all cases. • Likely that this can be solved (future work) Carnegie Mellon University

Summary (Theory) • Formal translations: • MIL to LIL • Closure conversion of LIL code • LIL to ITAL • Proof of soundness for each • New approach to dealing with typed RA • Provides a guide for...... Carnegie Mellon University

Practice Carnegie Mellon University

Real Compiler • Implemented a certifying back end for TILT. • Targets TAL for x86. • Type representation and analysis made explicit • Not gc interface (yet). • Data layout issues made explicit. • Boxing/unboxing. • Closure representations. • Heap data layout. Carnegie Mellon University

Elaborate HIL (Typed) Phase split MIL (Typed) Optimize MIL (Typed) Code Gen SML Source • Shrinking inlining • Speculative inlining • CSE/Dead code elim • Constant folding • Uncurrying • Monomorphization • Flattening • Eta reduction • Closure conversion • Hoisting • Others • Typecheck • Eliminate modules • Some data rep • Code generation • Type representation • Untyped output! • Subsequent compilation is mostly standard. RTL (Untyped) Carnegie Mellon University

Elaborate HIL (Typed) Phase split MIL (Typed) Optimize MIL (Typed) Code Gen SML Source RTL (Untyped) Carnegie Mellon University

New TILT IL • LIL: Low-level internal language • Based on LX (Crary & Weirich) • Data representation explicit • Still lambda calculus-ish • Call/return (not CPS) • All heap allocation explicit • Type analysis implemented at the term level • Neat Trick • See the dissertation Carnegie Mellon University

Front end MIL (Typed) Type rep LIL (Typed) Optimize LIL (Typed) Closure Conv LIL (Typed) Code Gen TAL (Typed) • Singleton elim • Dynamic type reps • Data rep structure • Unified allocation • CSE/Dead code elim • Constant folding • Eta reduction • Switch reduction • Others • Types and terms • Recursive code • Some opts • Direct to TALx86 • Reg alloc/cogen • Small peephole opts Carnegie Mellon University

Compilation fib.sml fib/asm.tal TILT TALx86 fib/obj.o fib/obj.to Carnegie Mellon University

Certifying Compilation for Standard ML in a Type Analysis Framework