680 likes | 1.59k Vues
One pass compiler Compiler Design. CSC532. Symbol Table. Stores the symbol of the source program as the compiler encounters them. Each entry contains the symbol name plus a number of parameters describing what is known about the symbol
 
                
                E N D
Symbol Table • Stores the symbol of the source program as the compiler encounters them. • Each entry contains the symbol name plus a number of parameters describing what is known about the symbol • Reserved words (if, then, else, etc.) maybe stored in the symbol table as well.
Symbol Table • As a minimum we must be able to • INSERT a new symbol into the table • RETRIEVE a symbol so that its parameters maybe retrieved and/or modified, • Query to find out if a symbol is already in the table. • Each entry can be implemented as a record. Records can have different formats (Variant records in Pascal).
Storing characters • Method 1: A fixed size space within each entry large enough to hold the largest possible name. Most names will be much shorter than this so there will be a lot of wasted storage • Method 2: Store all symbols in one large separate array. Each symbol is terminated with an end of symbol mark (EOS). Each symbol table record contains a pointer to the first character of the symbol. • Method n: modern languages (e.g. Java, C++ std components) has efficient DS, e.g. string or vector
Symbol Table Data Structure • One Linear list: • Easy to implement • search time will be very long if source has many symbols.
Symbol Table Data Structure Hash table: • Run the symbol name through a hash function to create an index in a table. • If some other symbol has already claimed the space then rehash with another hash function to get another index, etc. • Hash Table must be large enough to accommodate largest number of symbols.
Symbol Table Data Structure • Open hash: • Store the entries in a number of linear lists ( called Buckets). • Use a hash function on the symbol name to determine which lists to use. • A good hash function will spread the symbols across the buckets, so each linear list will be short.
Hash Functions • Goal is to get a hash function that generates a different index for each symbol name in the source. Index = f (string) • Some programmers use symbols like tmp1. tmp2, tmp3..so the hash function should use the last character of the name.
Hash Functions(continued) Other programmers use symbols like xvel, yvel, zvel..so the hash function should use the first character of the name. Best if all characters in the name are used. Characters should be given different weights so x2y2z, y2x2z, z2y2x…are hashed differently. Modern languages have hash functions/objects
CONTD. • Example source statement: • position := initial + rate * 60 • After lexical analysis: • id1 := id2 + id3 * 60 and three symbols are entered in the symbol table: • position • initial • Rate • After syntax analysis:
CONTD. After syntax analysis: := + id1 * id2 60 id3
CONTD. • After semantic analysis: := + id1 * id2 id3 inttoreal 60
CONTD. • After intermediate code generation: temp1 := inttoreal(60) temp2 := id3 * temp1 temp3 := id2 + temp2 id1 := temp3 • After code optimization: temp1 := id3 * 60.0 id1 := id2 + temp1 • After final code generation: MOVF id3, R2 MULF #60.0, R2 MOVF id2, R1 ADDF R2,R1 MOVF R1, id1
Some Definitions • Lexeme: The character sequence forming a token. Examples: :=, * ,+, rate ,60 • Syntax: What programs look like. • Semantics: What programs mean.
Context Free Grammar • Specifying the syntax of a language. • Also known as Backus-Naur Form or BNF. list  list + digit By itself is not CFG
Context Free Grammar • Example: In C an if-else statement looks like: if (expression) statement else statement The statement is a concatenation of 7 elements: • the keyword if, • opening parenthesis, • an expression, • a closing parenthesis, • a statement, • the keyword else, • a statement.
CFG • We write this as a production: stmt  if (expr) stmt else stmt where • stmt denotes a statement, • expr denotes an expression, • the arrow “” is read as “can have the form” • The tokens in this production are: if , else, () • The variables are stmt and expr. (Non-terminal) • Variables are sequences of tokens and are called non-terminals.
CFG: Notation • A context free grammar has 4 components: • A set of tokens known as terminal symbols • A set of non terminals • A set of productions • A non terminal designated as a start symbol.
CFG: Example • The productions are: list  list + digit list  list – digit list  digit digit  0|1|2|3|4|5|6|7|8|9
CFG: Example • The vertical lines in the last production mean “or”. A digit can have the form of 0 or 1 or 2, etc. The first three productions can be combined: • list  list + digit | list – digit | digit • The tokens (terminals) of this grammar are: + - 0 1 2 3 4 5 6 7 8 9 • The non terminals are list and digit, with list being the starting non-terminal because its productions are written first. • What is 9-5+2 ?
This is the parse tree for 9-5+2 list list digit digit list digit + - 2 9 5
CFG: Another example block  begin opt_stmts end opt_stmts  stmts_list| Є stmts_list  stmts_list;stmt| stmt WhereЄ = empty string of symbols.
CFG: Another example • Ambiguity. • Consider a grammar with a single production: string  string + string | string - string | 0|1|2|3|4|5|6|7|8|9 • string like 9-5+2 will have two parse trees:
9-5+2 will have two parse trees string string string + string string - string string 2 9 string - string string + 5 2 9 5
Ambiguity • The left parse tree parses the expression as though it were written (9-5) +2 which equals 6. • The right parse tree parses the expression as though it were written 9- (5+2) which equals 2. • It is important to have only one parse tree for any string of symbols. The grammar should be unambiguous.
Ambiguity Reduction • Associativity of operators: • Precedence of operators: • Syntax for arithmetic expressions: Assume the basic units are digits and parenthesized expressions. • Factor  digit | (expr)
Associativity of operators: • In most languages addition, subtraction, multiplication and division are left associative. • Exponentiation is usually right associative. • In C the assignment operator, = , is right associative. A = b = c is treated like a = (b = c).
Precedence of operators: • Usually multiplication and division have higher precedence than addition and subtraction. • An expression like 9+5*2 • 9+(5*2), not (9+5) * 2.
Syntax for arithmetic expressions • The binary operators * and / have highest precedence. They are left associative. term  term * factor|term/factor|factor • Terms are combined with + and -: Therefore the resultant grammar is: expr  expr + term | expr – term| term termterm * factor|term/factor| factor factordigit|(expr) digit  0|1|2|3|4|5|6|7|8|9
Syntax of our Source Language • program  program id (identifier_list); declarations subprogram_declarations compound_statement • identifier_list  id|identifier_list, id • declarations  declarations var identifier_list:type;|e • type  standard_type|array[num..num] of standard_type • standard_type  integer|real
subprogram_declarations  subprogram_declarations subprogram_declaration;|e • subprogram_declaration  subprogram_head declarations compound_statement • subprogram_head  function id arguments : standard_type;|procedure id arguments; • arguments  (parameter_list)|e • parameter_list  identifier_list : type | parameter_list ;identifier_list : type • Compound_statement  begin optional_statements end • optional_statements  statement_list | e • statement_list  statement | statement_list ; statement
statement variable assignop expression | procedure_statement | compound_statement | if expression then statement else statement | while expression do statement • variable  id | id [expression] • procedure_statement  id | id (expression_list) • expression_list  expression | expression_list, expression • expression  simple_expression | simple_expression relop simple_expression
simple_expression  term | sign term | simple_expression addop term • term  factor | term mulop factor • factor  id | id (expression_list) |num | (expression)| not factor • sign  + | -
Syntax – Directed Translation • Associate a set of attributes with each grammar symbol. With each production associate a set of semantic rules for computing values of the attributes. • Synthesized attribute: The value of the attribute at any node of a parse tree can br computed from the attribute values of the children at the node. • Can be evaluated by a single bottom – up traversal of the parse tree.
SDT (continued) • Example : Translating infix notation to postfix notation. If a node in the parse tree is labeled with X then let X.t be a string – valued attribute associated with the node. X.t || Y.t means concatenate X.t with Y.t
PRODUCTION expr  expr1 + term expr  expr1 – term expr  term term  0 term  1 …… term  9 SEMANTIC RULE expr.t := expr1.t || term.t || ‘+’ expr.t := expr1.t || term.t || ‘-’ expr.t := term.t term.t := ‘0’ term.t := ‘1’ ….. term.t := ‘9’ Syntax Directed Definition
Attribute Values at Nodes in Parse Tree expr.t = 95-2+ expr.t = 95- term.t = 2 expr.t = 9 term.t = 5 term.t = 9 2 9 - 5 +
PRODUCTION seq  begin seq  seq1 instr instr  east instr  north instr  west instr  south SEMANTIC RULES seq.x := 0 seq.y := 0 seq.x := seq1.x + instr.dx seq.y := seq1.y + instr.dy instr.dx := 1 instr.dy := 0 instr.dx := 0 instr.dy := 1 instr.dx := -1 instr.dy := 0 instr.dx := 0 instr.dy := -1 Example : Robot
seq.x = -1 seq.y = -1 seq.x = -1 seq.y = 0 instr.dx = 0 instr.dy = -1 seq.x = 0 seq.y= 0 instr.dx = -1 instr.dy = 0 begin south west
Translation Schemes • Translation scheme: A context-free grammar with semantic actions embedded within the right sides of the productions. • Example : rest  + term {print (‘+’)} rest1 • The semantic action is enclosed within braces. The production itself is : rest  + term rest1 • Parse tree: Do a post order traversal of the tree. After the + and term leaves are traversed, the {print (‘+’)} leaf is traversed and the semantic action is performed, then the rest1leaf is traversed and then the root, rest is visited.
In a simple syntax-directed definition the translation order of the non terminals on the right sides is the same as their order in the productions. These definitions can be implemented with translation schemes. rest rest1 term {print (‘+’)} +
Example: Translating into Post-fix Form • expr  expr + term {print (‘+’)} • expr  expr - term {print (‘-’)} • expr  term • term  0 {print (‘0’)} • term  1 {print (‘1’)} • …… • term  9 {print (‘9’)}
Parsing • Determines if a string of tokens can be generated by a grammar • Parser can be constructed for any grammar • For any context-free grammar there is a parser that takes at most O (n3) time to parse a string of n tokens. • Almost all programming languages that arise in practice can be parsed in O (n) time making a single left-to-right scan of the input looking ahead one token at a time. • Two classes of parsing methods : Top-down – Construct the parse tree starting at the root and working down towards the leaves. Bottom-up – Construct the parse tree starting at the leaves and working up toward the roots. • Efficient top-down parsers easier to construct • Bottom-up parsers handle larger class of grammar and translation schemes.
Top – Down Parsing • Recursive-decent parsing is a top-down method where we execute a set of recursive procedures to process the input. • Predictive parsing – a special case of recursive-decent parsing. - can be used if the scanned input symbol unambiguously determines the production selected for each nonterminal. • Example grammar: type  simple | id |array [simple] of type simple  integer | char | num .. Num
Pseudo Code for Predictive Parser procedure match (t: token); begin if lookahead = t then lookahead := nexttoken else error end; procedure type; begin if lookahead is in {integer, char, num} then simple else if lookahead =‘ ’ then begin match (‘ ’ ); match (id) end else if lookagead = array then begin match(array); match(‘[’); simple; match (‘]’); match (of); type end else error end;
procedure simple; begin if lookahead = integer then match(inteher) else if lookahead = char then match (char) else if lookahead = num then begin match(num); match(..); match(num) end else error end;
No need to backtrack as long as the first tokens on the right sides of the productions are disjoint. • e-productions: If any non terminal has an e-production then treat the e-production last. There is no “else error” at the end of the procedure. • Left-recursion requires special handling. A production like expr  expr + term is left-recursive. If the expr procedure calls itself at the beginning the parser will loop forever. Usually the production can be re-written to make it right-recursive. • Example: expr  expr + term | term produces sequences like:
term term + term term + term + term ….. • The same sequence can be produced with the following grammar: expr  term rest rest  + term rest | e