380 likes | 422 Vues
Understand how Parse trees differ from Abstract Syntax Trees (AST) and why AST is crucial in compiler construction. Dive into Concrete and Abstract Syntax, AST data structures, and implementation in different programming languages like C, F#, SML, and Java.
 
                
                E N D
Abstract Syntax Trees Compiler Baojian Hua bjhua@ustc.edu.cn
Front End lexical analyzer source code tokens abstract syntax tree parser semantic analyzer IR
Recap • Lexer • Program source to token sequence • Parser • token sequence, and answer Y or N • Today’s topic: • abstract syntax trees
E E * E 15 ( E ) E + E 3 4 Abstract Syntax Trees • Parse trees encodes the grammatical structure of the source program • However, they contain a lot of unnecessary information • What are essential here?
E E * E 15 ( E ) E + E 3 4 Abstract Syntax Trees • For the compiler to understand an expression, it only need to know operators and operands • punctuations, parentheses, etc. are not needed • Similar for statements, functions, etc.
E E * E 15 ( E ) E + E 3 4 Abstract Syntax Trees * 15 + 3 4 Parse tree Abstract syntax tree
Concrete vs Abstract Syntax • Concrete Syntax is needed for parsing • includes punctuation symbols, factoring, elimination of left recursion, depends on the format of the input • Abstract Syntax is a simpler, more convenient internal representation • clean interface between the parser and the later phases of the compiler
S E E + T T T * F x F F 2 3 Concrete and Abstract Syntax 2 + 3 * x E ::= E + T | T T ::= T * F | F F ::= id | num | ( E )
Concrete and Abstract Syntax 2 + 3 * x E ::= id | num | E + E | E * E | ( E ) + 2 * 3 x
AST Data Structures • In the compiler, abstract syntax makes use of the implementation language to represent aspects of the grammatical structure • Highly target and implementation languages dependent • art more than science
AST for “exp” in C /* data structures */ typedef struct E *E; enum kind {ID, INT, ADD, TIMES}; struct E { enum kind kind; union { char *id; int num; struct {E e1; E e2;} add; struct {E e1; E e2;} times; } u; }; // This technique is called tagged-union. E -> id | num | E + E | E * E | ( E )
AST in C /* sample program “2+3*x” */ E e1 = malloc (sizeof (*e1)); e1->kind = INT; e1->u.num = 3; E e2 = malloc (sizeof (*e2)); e2->kind = ID; e2->u.id = “x”; E e3 = malloc (sizeof (*e3)); e3->kind = TIMES; e3->u.times.e1 = e1; e2->u.times.e2 = e2; … /* boring and error-prone :-( */ E ::= id | num | E + E | E * E | ( E )
AST for “stm” in C /* data structures */ typedef List<S> SS; typedef struct S *S; enum kind {ASSIGN, PRINT}; struct S { enum kind kind; union { struct {char *id; E e;} assign; E print; } u; }; (* to encode “x:=3; print(x) *) prog = …; // leave to you… SS -> S | SS ; S S -> id := E | print (E)
Operations are tree-walkings (* pretty printing *) int pp (E e){ switch (e->kind) { case INT: printf (“%d”, e->u.num); return; case ID: printf (“%s”, e->u.id); return; case ADD: printf (“(“); pp (e->u.add.e1); printf (“)”); printf (“ + “); printf (“(“); pp (e->u.add.e2); printf (“)”); return; case TIMES: /* similar */ default: error (“compiler bug”); } }
Operations are tree-walkings (* number of nodes in an AST *) int numNodes (E e) { switch (e->kind) { case INT: return 1; case ID: return 1; case ADD: case TIMES: return 1 + numNodes (e->u.add.e1) + numNodes (e->u.add.e2); default: error (“compiler bug”); } } C compiler is stupid!
AST for “exp” in F# (* data structures *) type E = Int of int | Id of string | Add of E * E | Times of E * E E ::= id | num | E + E | E * E | ( E ) (* to encode “2+3*x” *) val prog = Add (Int 2 , Times (Int 3 , Id “x”)) (* Easy and happy! *)
AST for “stm” in SML /* data structures */ type SS = S list type S = Assign of string * exp | Print of exp (* to encode “x:=3; print(x)” *) val prog = [Assign (“x”, 3), Print (“x”)] SS -> S | SS ; S S -> x := E | print (E)
AST in F# (* number of nodes *) let rec numNodes e = match e with Int _ => 1 | Id _ => 1 | Add (e1, e2) => 1 + (numNodes e1) + (numNodes e2) | Times (e1, e2) => 1 + (numNodes e1) + (numNodes e2) (* Note this may be too inefficient, why? *)
AST in F# (* tail-recursion using caching *) let rec numNodes (e, n) = match e with Int _ => 1 + n | Id _ => 1 + n | Add (e1, e2) => let val n’ = numNodes (e1, n) in numNodes (e2, 1+n’) end | Times (e1, e2) => …(*similar)
AST in F# (* yet another version using reference *) val nodes = ref 0; val op ++ = fn x => x := !x + 1 let rec numNodes e = match e with Int _ => ++ nodes | Id _ => ++ nodes | Add (e1, e2) => (numNodes e1 ; ++ nodes ; numNodes e2) ) | Times (e1, e2) => … (* similar *)
AST for “exp” in Java /* data structures */ abstract class Exp {} class IntExp extends Exp { int i; IntExp (int i){ this.i = i; } } // contructors omitted from the following classes class IdExp extends Exp {String id;} class AddExp extends Exp {Exp e1; Exp e2;} class TimesExp extends Exp {Exp e1; Exp e2;} E ::= id | num | E + E | E * E | ( E )
Local Class Hierarchy Exp E ::= id | num | E + E | E * E | ( E ) IntExp IdExp AddExp TimesExp /* to encode “2+3*x” */ Exp e = new AddExp(new IntExp (2) , new TimesExp (new IntExp (3) , new IdExp (“x”))) /* Not so ugly as that in C, but still boring */
AST for “stm” in Java /* data structures */ class Stms{ LinkedList<Stm> stms; } class Stm{} class AssignStm extends Stm{ String x; Exp e; } class PrintStm extends Stm {Exp e;} (* to encode “x:=3; print(x)” *) val prog = LinkedList.addAll(new AssignStm (…) , new PrintStm(…)); stms -> stm | stms ; stm stm -> x := e | print (e)
AST in Java (* number of nodes again *) int numNodes (Exp e) { if (e instanceof IntExp) return 1; else if (e instanceof IdExp) return 1; else if (e instanceof AddExp) { AddExp f = (Add)e; return 1 + numNodes(f.e1) + numNodes(f.e2); } … } But this break the modularity of Java. A better way is to use the so-called visitor pattern. Read Tiger chap 4 and do lab 2.
AST Generations • Attribute-grammar scheme • each nonterminal may have a semantic value v associated with it • when the parser recognizes rule X -> s1 … sn • a semantic action will be executed • uses semantic values from symbols in si • when parsing completes successfully • parser returns semantic value associated with the start symbol • usually an abstract syntax tree
AST Generations in tools • In a top-down parser, ASTs are returned (recursively) as values • Yacc-like tools encode this strategy in semantics actions
List<S> S E /* AST generation in recursive decedent parser */ List<S> parse_stms () = List<S> list = new List<S>(); S stm = parse_stm (); list.addLast(stm); while (current_token == ;) eat (;); stm = parse_stm (); list.addLast (stm); return list; SS -> S | S ; SS S -> id := E | print (E) E -> id | num | E + E | E * E
List<S> S E /* AST generation in recursive decedent parser */ S parse_stm () = switch (current_token) case ID: String x = current_token; // remember the “x” eat(ASSIGN); E exp = parse_exp (); S stm = new AssignStm (x, exp); return stm; case PRINT: eat(PRINT); eat (LPAREN); E exp = parse_exp (); eat (RPAREN); S stm = new PrintStm (exp); return stm; SS -> S | S ; SS S -> id := E | print (E) E -> id | num | E + E | E * E
List<S> S E /* AST generation in recursive decedent parser */ E parse_addexp () = E e1 = parse_timesexp (); while (current_token == +) eat (+); E e2 = parse_timesexp (); E e3 = new AddExp (e1, e2); return e3; E parse_timesexp () = E e1 = parse_atom (); while (current_token == *) eat (*); E e2 = parse_atom (); E e3 = new AddExp (e1, e2); return e3; SS -> S | S ; SS S -> id := E | print (E) E -> id | num | E + E | E * E
List<S> S E /* AST generation in recursive decedent parser */ E parse_atom () = switch (current_token) case ID: return new IdExp (current_token); case NUM: return new NumExp (current_token); default: error (“want ID or NUM”, but got …); SS -> S | S ; SS S -> id := E | print (E) E -> id | num | E + E | E * E
E AST generation in LR parser 2 F T E E + E + 3 E + F E + T 2 + 3 * 4 + 3 * 4 + 3 * 4 + 3 * 4 + 3 * 4 3 * 4 * 4  * 4  * 4 S + * 2 + T E 3 T * F 2 T 4 Each nonterminal is associated with a tree. 3 F 4 2 F 4 3 2 3 2
AST generation in LR parser e -> e PLUS e ($$ = Add ($1, $3)) | e TIMES e ($$ = Times ($1, $3)) | ID ($$ = Id ($1)) | NUM ($$ = Num ($1))
Source Position • In one-pass compiler, error messages are precise • early compilers never worry about this • But in a multi-pass compiler, source positions must be stored in AST for later use (* Example *) class AddExp{ Exp left; Exp right; int lineNum; int columnNum; }
Summary • Abstract syntax trees are compiler internal data structures of source programs • interface between front-end and compiler later parts • Abstract syntax trees design is language-dependent • more art than science