Create Presentation
Download Presentation

Download Presentation
## Certifying Compilation for Standard ML in a Type Analysis Framework

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**3**3 3 4 4 4 5 5 5 3 3 4 4 5 5 3 3 3 4 4 4 5 5 5 Certifying Compilation for Standard ML in a Type Analysis Framework Leaf Petersen Carnegie Mellon University**Motivation**Carnegie Mellon University**Types**• Types capture facts about programs. • Fact: This procedure expects a 32 bit integer. • Fact: This address points to executable code. • Fact: This data structure was produced here. • Programmers use types: • To keep their facts straight. • Capture and preserve invariants. • To check their facts. • Typechecker verifies truth. • Manage complexity. Carnegie Mellon University**P.o**P1:T1 P2 Pn .... Types and Compilers • Compilers use types. • Predict size of data. • Eliminate unnecessary dynamic checks. • Most compilers forget types early. Carnegie Mellon University**P.o : To**P1:T1 P2:T2 Pn:Tn .... Type Preserving Compilation • Transform types with program. • Optimize code based on types. • Verify that invariants still hold. • Emit types on object code. Carnegie Mellon University**TILT**• Type preserving compiler • Standard ML. • Sparc, Alpha, (now) x86 backends • Perry Cheng, Chris Stone, Leaf Petersen, Dave Swasey, and others. • Intermediate languages are typed • Type based optimizations. • Internal correctness checks. • Generates typed x86 object code (this thesis). Carnegie Mellon University**Why TILT?**• Want to compile SML efficiently. • Separate compilation is a must. • Traditional optimizations. • Loop optimizations, CSE, constant folding, and many more. • New challenges for optimization. • Polymorphism, GC, 1st class functions, modules, etc. Carnegie Mellon University**Ptr/non-ptr**Example: Unknown Types. • Module interfaces (and polymorphism) introduce unknown types: • Clients compiled against interface • Cannot know what t is (may be instantiated multiple times) • Cannot predict size of value (if sizes vary). • Cannot predict traceability of value. Carnegie Mellon University**Old Solutions**• C, C++, Java: No unknown types. • Objects: “partially known” types. • Traditional ML/Lisp compilers: Uniform data representation. • All values are same size (e.g. 32 bits). • Large values (e.g. 64 bit floats) must be boxed. • Traceability dealt with via tagging (e.g. 31 bit ints). Carnegie Mellon University**TILT Solution**• Types tell size and traceability of data. • Unknown types are instantiated with known types at runtime. • Most compilers discard types before generating code. • TILT: Keep types at runtime and use them to dynamically determine layout and traceability. Carnegie Mellon University**Type analysis**• type Optarray[t] = Typecase[t] of Boxed(Float) => Array64[Float] | _ => Array32[t] • Note: • Optarray[Int] == Array32[Int] • Optarray[a] where a is unknown is dynamic • Constructor for type Optarray? • optarray[t] : int x t -> Optarray[t] Carnegie Mellon University**Type analysis**• optarray[t](len : int,init : t) : Optarray[t] = typecase [t] of Boxed (Float) => new_array64[Float](len, unbox(init)) | _ => new_array32[t](len,init) • For statically known types, reduces at compile time • optarray[Int](10,0) = new_array32[Int](10,0) • For unknown types, reduces at runtime Carnegie Mellon University**Type-passing Optimizations**• Type analysis: • Enables global representation optimizations in the presence of unknown types. • TILT uses types at runtime for: • Better data-layouts. • Unboxed arrays of 64 bit floats • 32 bit ints • Optimized sum representations • Flatten aggregate arguments into registers. • Mostly tag-free garbage collection. Carnegie Mellon University**There’s more**• Types can help with generating efficient code. • But not the end of the story.... Carnegie Mellon University**Mobile Code**• Code has become mobile. • May know very little about producer. • Examples: • Web applets. • Grid computing. • Binary installations/upgrades. • Application downloads. • High risk from malicious/wrong code. Carnegie Mellon University**The Certification Problem**• Source language safety is checkable. • Typechecker checks the programmers facts. • Raw object code is not checkable. • Safety relies on trust in: • Safety of source language. • Correctness/identity of producer/compiler. • Integrity of the object code. Carnegie Mellon University**Java Approach**• Java bytecode • High-level language (almost Java) • Can be typechecked • Interpreted • slow, somewhat complicated • JIT compiled • somewhat faster, quite complicated • Large trusted computing base Carnegie Mellon University**Certified Code**• Typed object code • Types certify safety • Code consumer • Does no compiling • Checks that certificate applies (easy) • Small trusted computing base • Several instances exist: • TAL: Typed Assembly Language • PCC: Proof Carrying Code • Many extensions and variations Carnegie Mellon University**Certifying Compilers**• Programs in safe languages • Types provide needed annotations • Compiler can emit code with certificate of type/memory safety • Certifying compilers exist for: • Safe subsets of C (TAL & PCC) • Java (PCC) • Now for Standard ML Carnegie Mellon University**Types in Compilation**• Types can be used to generate efficient code. • Types can be used to generate certified code. • Want to combine the two paradigms. Carnegie Mellon University**My Thesis**Certifying compilation of type analyzing code is feasible for a full modern language such as Standard ML. Carnegie Mellon University**Two compilers**• Theoretical compiler • Formal translation • Prove important properties • Guide the implementation • Real compiler • Follows the structure of the theoretical compiler • Targets a real certified code system. Carnegie Mellon University**Theory**Carnegie Mellon University**Theoretical compiler**• Three languages: • Singleton free MIL • LIL • Idealized TAL (ITAL) • Formal translations: • MIL to LIL • Closure conversion of LIL code • LIL to ITAL Carnegie Mellon University**Languages**• Singleton free MIL • Lambda calculus • Syntactic restriction to named form • Type analysis through primitives • LIL • Much more fine-grained than MIL • type and type analysis representation • closure representation • ITAL • Machine language • Idealized TAL • Simplified TAL with LX primitives for type analysis Carnegie Mellon University**Translations**• MIL to LIL • Very different type structure • Moderately different term structure • See my dissertation. • Closure conversion • Very standard • LIL to ITAL • Type structure is almost identical • Term structure is very different • Explicit control flow • Binding replaced with state modification Carnegie Mellon University**LIL typing**;;` e : • – LIL heap context • – LIL type context • – LIL term context • e – LIL expression (named form) • – LIL type for e Carnegie Mellon University**ITAL typing**;;M` I ok • – ITAL heap context • – ITAL type context • M – ITAL register file type • I – ITAL instruction sequence Carnegie Mellon University**ITAL typing**;;M` I ok • – ITAL heap context • – ITAL type context • M – ITAL register file type • I – ITAL instruction sequence Carnegie Mellon University**Register files**• A register file type M maps registers to ITAL types • e.g. M(r) = • Notation: M{r:} means M with the type of r set to . • Designated stack pointer register sp • M(sp) = • describes the stack slots Carnegie Mellon University**LIL to ITAL Translations**• || - heap context translation • || - type context translation • || - type translation • Exp e maps to instruction seq I • But what is the translation of a term context? Carnegie Mellon University**Register files**• LIL variables occupy ITAL registers (or stack slots) • Hence, the translation of a LIL context is an ITAL register file. • Problem: what register file? • Variables are related to registers via register allocation. Carnegie Mellon University**Register allocation**• Previous work builds register allocation into the translation. • Complex and tedious • Unclear how to incorporate real RA (e.g. Graph coloring) • Consequently, toy register allocators are used in formal presentations • Better idea: translate with respect to abstract register allocator. Carnegie Mellon University**Allocator**Definition: An allocator A is an object such that: • For every variable x: • A(x) = r or A(x) = sp(i) • frmsz(A) is a natural number • For every LIL typing context and stack type , ||A = M for some register file type M Carnegie Mellon University**Translation judgment**;;;A,` e : Ã I • – LIL heap context • – LIL type context • – LIL term context • A – Allocator • – describes stack below frame • I – ITAL instruction sequence • For this talk, I’m ignoring exceptions, other stuff. Carnegie Mellon University**Translation judgment**;;;A[z! r1 , x! r1 , y! r2] , ` z = x+y : intÃadd r1,r2 Carnegie Mellon University**Question**;;;A,` e : Ã I • Why should I be well-typed? • Is the equational theory rich enough? • Easy to rely on equations that don’t hold • Want to show soundness: • Each translation maps well-typed terms to well-typed terms. • Doesn’t hold for all allocators: only the good ones. Carnegie Mellon University**Good allocator for **Definition: Let M = ||A. We say that A is a good allocator for if: • M(sp) = f± such that frmsz(A) = f • |²|A is the empty machine state. • If = 1, x:, 2 then • A is a good allocator for 1 and 2 • If A(x) = r then ||A = |1,2|A{r:||} • If A(x) = sp(i) then something similar. Carnegie Mellon University**Good allocator for e**Definition: An allocator A is a good allocator for an expression e if: • For all derivations of ;; ` e : , A is a good allocator for . • A is a good allocator for all sub-expressions of e. Carnegie Mellon University**Soundness**Theorem: If A is a good allocator for e and ;;` e : and is a well-formed stack type and ;;;A,` e : Ã I then ||;||;M` I ok where M = ||A Carnegie Mellon University**Benefits of this approach**• Theory close to implementation • Register allocation is a parameter • Separates out the mechanism • Concise specification of interface between code gen and RA • Translation isn’t bogged down with algorithmic details of RA Carnegie Mellon University**Downside: completeness**• Depends on register allocator • Full completeness doesn’t hold • Possible to show parametric completeness? • Not clear what this means • Worthwhile tradeoff • Formal presentation very close to implementation • In practice: • Soundness is hard (implementation had bugs). • Completeness is just a matter of covering all cases. • Likely that this can be solved (future work) Carnegie Mellon University**Summary (Theory)**• Formal translations: • MIL to LIL • Closure conversion of LIL code • LIL to ITAL • Proof of soundness for each • New approach to dealing with typed RA • Provides a guide for...... Carnegie Mellon University**Practice**Carnegie Mellon University**Real Compiler**• Implemented a certifying back end for TILT. • Targets TAL for x86. • Type representation and analysis made explicit • Not gc interface (yet). • Data layout issues made explicit. • Boxing/unboxing. • Closure representations. • Heap data layout. Carnegie Mellon University**Elaborate**HIL (Typed) Phase split MIL (Typed) Optimize MIL (Typed) Code Gen SML Source • Shrinking inlining • Speculative inlining • CSE/Dead code elim • Constant folding • Uncurrying • Monomorphization • Flattening • Eta reduction • Closure conversion • Hoisting • Others • Typecheck • Eliminate modules • Some data rep • Code generation • Type representation • Untyped output! • Subsequent compilation is mostly standard. RTL (Untyped) Carnegie Mellon University**Elaborate**HIL (Typed) Phase split MIL (Typed) Optimize MIL (Typed) Code Gen SML Source RTL (Untyped) Carnegie Mellon University**New TILT IL**• LIL: Low-level internal language • Based on LX (Crary & Weirich) • Data representation explicit • Still lambda calculus-ish • Call/return (not CPS) • All heap allocation explicit • Type analysis implemented at the term level • Neat Trick • See the dissertation Carnegie Mellon University**Front end**MIL (Typed) Type rep LIL (Typed) Optimize LIL (Typed) Closure Conv LIL (Typed) Code Gen TAL (Typed) • Singleton elim • Dynamic type reps • Data rep structure • Unified allocation • CSE/Dead code elim • Constant folding • Eta reduction • Switch reduction • Others • Types and terms • Recursive code • Some opts • Direct to TALx86 • Reg alloc/cogen • Small peephole opts Carnegie Mellon University**Compilation**fib.sml fib/asm.tal TILT TALx86 fib/obj.o fib/obj.to Carnegie Mellon University