Winter 2012-2013 Compiler Principles Syntax Analysis (Parsing) – Part 2

Winter 2012-2013Compiler PrinciplesSyntax Analysis (Parsing) – Part 2 Mayer Goldberg and Roman Manevich Ben-Gurion University

Today • Review top-down parsing • Recursive descent • LL(1) parsing • Start bottom-up parsing • LR(k) • SLR(k) • LALR(k)

Top-down parsing • Parser repeatedly faces the following problem • Given program P starting with terminal t • Non-terminal N with list of possible production rules: N α1 … N αk • Predict which rule can should be used to derive P

Recursive descent parsing • Define a function for every nonterminal • Every function work as follows • Find applicable production rule • Terminal function checks match with next input token • Nonterminal function calls (recursively) other functions • If there are several applicable productions for a nonterminal, use lookahead

Boolean expressions example E  LIT | (E OP E) | not E LIT true|false OP and | or | xor not ( not true or false ) production to apply known from next token E E E => notE => not ( E OP E ) => not ( not E OP E ) => not ( not LIT OP E ) => not ( not true OP E ) => not ( not true or E ) => not ( not true or LIT ) => not ( not true or false ) not ( E OP E ) not LIT or LIT true false

Flavors of top-down parsers • Manually constructed • Recursive descent (previous lecture, review now) • Generated (this lecture) • Based on pushdown automata • Does not use recursion

Recursive descent parsing • Define a function for every nonterminal • Every function work as follows • Find applicable production rule • Terminal function checks match with next input token • Nonterminal function calls (recursively) other functions • If there are several applicable productions for a nonterminal, use lookahead

Matching tokens E  LIT | (E OP E) | not E LIT true|false OP and | or | xor match(token t) { if (current == t) current = next_token() else error } Variable current holds the current input token

Functions for nonterminals E  LIT | (E OP E) | not E LIT true|false OP and | or | xor E() { if (current  {TRUE, FALSE}) // E LIT LIT(); else if (current == LPAREN) // E ( E OP E ) match(LPAREN); E(); OP(); E(); match(RPAREN); else if (current == NOT) // E not E match(NOT); E(); else error; } LIT() { if (current == TRUE) match(TRUE); else if (current == FALSE) match(FALSE); else error; }

Implementation via recursion E() { if (current  {TRUE, FALSE}) LIT(); else if (current == LPAREN) match(LPARENT); E(); OP(); E(); match(RPAREN); else if (current == NOT) match(NOT); E(); else error; } E → LIT | ( E OP E ) | not E LIT → true | false OP → and | or | xor LIT() { if (current == TRUE) match(TRUE); else if (current == FALSE) match(FALSE); else error; } OP() { if (current == AND) match(AND); else if (current == OR) match(OR); else if (current == XOR) match(XOR); else error; }

How is prediction done? p. 189 • For simplicity, let’s assume no null production rules • See book for general case • Find out the token that can appear first in a rule – FIRST sets

FIRST sets • For every production rule Aα • FIRST(α) = all terminals that α can start with • Every token that can appear as first in α under some derivation for α • In our Boolean expressions example • FIRST( LIT ) = { true, false } • FIRST( ( E OP E ) ) = { ‘(‘ } • FIRST( not E ) = { not } • No intersection between FIRST sets => can always pick a single rule • If the FIRST sets intersect, may need longer lookahead • LL(k) = class of grammars in which production rule can be determined using a lookahead of k tokens • LL(1) is an important and useful class

Computing FIRST sets Assume no null productions A  Initially, for all nonterminalsA, setFIRST(A) = { t | Atω for some ω } Repeat the following until no changes occur:for each nonterminal A for each production A Bω set FIRST(A) = FIRST(A) ∪ FIRST(B) This is known as fixed-point computation

LL(k) grammars • A grammar is in the class LL(K) when it can be derived via: • Top-down derivation • Scanning the input from left to right (L) • Producing the leftmost derivation (L) • With lookahead of k tokens (k) • For every two productions Aα and Aβ we have FIRST(α) ∩ FIRST(β) = {}(and FIRST(A) ∩ FOLLOW(A) = {} for null productions) • A language is said to be LL(k) when it has an LL(k) grammar • What can we do if grammar is not LL(k)?

LL(k) parsing via pushdown automata • Pushdown automaton uses • Prediction stack • Input stream • Transition table • nonterminals x tokens -> production alternative • Entry indexed by nonterminal N and token t contains the alternative of N that must be predicated when current input starts with t

LL(k) parsing via pushdown automata • Two possible moves • Prediction • When top of stack is nonterminal N, pop N, lookup table[N,t]. If table[N,t] is not empty, push table[N,t] on prediction stack, otherwise – syntax error • Match • When top of prediction stack is a terminal T, must be equal to next input token t. If (t == T), pop T and consume t • If (t ≠ T) syntax error • Parsing terminates when prediction stack is empty • If input is empty at that point, success. Otherwise, syntax error

Model of non-recursivepredictive parser Predictive Parsing program Stack Output Parsing Table

Example transition table (1) E → LIT (2) E → ( E OP E ) (3) E → not E (4) LIT → true (5) LIT → false (6) OP → and (7) OP → or (8) OP → xor Which rule should be used Input tokens Nonterminals

Running parser example aacbb$ A  aAb | c

Illegal input example abcbb$ A  aAb | c

Using top-down parsing approach • Compute parsing table • If table is conflict-free then we have an LL(k) parser • If table contains conflicts investigate • If grammar is ambiguous try to disambiguate • Try using left-factoring/substitution/left-recursion elimination to remove conflicts

Marking “end-of-file” Sometimes it will be useful to transform a grammar G with start non-terminal S into a grammar G’ with a new start non-terminal S‘ with a new production rule S’  S $where $ is not part of the set of tokens To parse an input P with G’ we change it into P$ Simplifies top-down parsing with null productions and LR parsing

Bottom-up parsing • No problem with left recursion • Widely used in practice • Shift-reduce parsing: LR(k), SLR, LALR • All follow the same pushdown-based algorithm • Read input left-to-right producing rightmost derivation • Differ on type of “LR Items” • Parser generator CUP implements LALR

Some terminology • The opposite of derivation is called reduction • Let Aα be a production rule • Let βAµ be a sentence • Replace left-hand side of rule in sentence:βAµ=> βαµ • A handle is a substring that is reduced during a series of steps in a rightmost derivation

Rightmost derivation example 1 + (2) + (3) E  E + (E) E i E + (2) + (3) Each non-leaf node represents a handle E + (E) + (3) Rightmost derivation in reverse E + (3) E E + (E) E E E E E 1 + ( 2 ) + 3 ( )

LR item To be matched Already matched Input N  αβ Hypothesis about αβ being a possible handle, so far we’ve matched α, expecting to see β

LR items N  αβ Shift Item N  αβ Reduce Item

Example Z  exprEOF expr  term | expr+ term term  ID | (expr) Z  E $ E  T | E + T T  i | ( E ) (just shorthand of the grammar on the top)

Example: parsing with LR items Z  E $ E  T | E + T T  i | ( E ) i + i $ Z  E $ Why do we need these additional LR items? Where do they come from? What do they mean? E  T E  E + T T  i T  (E )

-closure { Z  E $, Z  E $ E  T | E + T T  i | ( E ) E  T, -closure({Z  E $}) = E  E + T, T  i , T  ( E ) } Given a set S of LR(0) items If P  αNβ is in S then for each rule N  in the grammarS must also contain N 

Example: parsing with LR items Z  E $ E  T | E + T T  i | ( E ) i + i $ Remember position from which we’re trying to reduce Items denote possible future handles Z  E $ E  T E  E + T T  i T  ( E )

Example: parsing with LR items Z  E $ E  T | E + T T  i | ( E ) i + i $ Match items with current token Z  E $ T  i Reduce item! E  T E  E + T T  i T  ( E )

Example: parsing with LR items Z  E $ E  T | E + T T  i | ( E ) T + i $ i Z  E $ Reduce item! E  T E  T E  E + T T  i T  ( E )

Example: parsing with LR items Z  E $ E  T | E + T T  i | ( E ) E + i $ T i Z  E $ Reduce item! E  T E  T E  E + T T  i T  ( E )

Example: parsing with LR items Z  E $ E  T | E + T T  i | ( E ) E + i $ T i Z  E $ Z  E$ E  T E  E + T E  E+ T T  i T  ( E )

Example: parsing with LR items Z  E $ E  T | E + T T  i | ( E ) E + i $ T i Z  E $ Z  E$ E  E+T E  T T  i E  E + T E  E+ T T  ( E ) T  i T  ( E )

Example: parsing with LR items Z  E $ E  T | E + T T  i | ( E ) E + T $ i T i Z  E $ Z  E$ E  E+T E  T T  i E  E + T E  E+ T T  ( E ) T  i T  ( E )

Example: parsing with LR items Z  E $ E  T | E + T T  i | ( E ) E + T $ T i i Reduce item! Z  E $ Z  E$ E  E+T E  E+T E  T T  i E  E + T E  E+ T T  ( E ) T  i T  ( E )

Example: parsing with LR items Z  E $ E  T | E + T T  i | ( E ) E $ E + T T i i Z  E $ Z  E$ E  T E  E + T E  E+ T T  i T  ( E )

Example: parsing with LR items Z  E $ E  T | E + T T  i | ( E ) E $ E + T T i Reduce item! i Z  E $ Z  E$ Z  E$ E  T E  E + T E  E+ T T  i T  ( E )

Example: parsing with LR items Z  E $ E  T | E + T T  i | ( E ) Z E $ E + T Reduce item! i T Z  E $ i Z  E$ Z  E$ E  T E  E + T E  E+ T T  i T  ( E )

Computing item sets • Initial set • Z is in the start symbol • -closure({ Zα | Zαis in the grammar } ) • Next set from a set S and the next symbol X • step(S,X) = { NαXβ | NαXβ in the item set S} • nextSet(S,X) = -closure(step(S,X))

LR(0) automaton example reduce state shift state q6 E  T T T q7 q0 T  (E) E  T E  E + T T  i T  (E) Z  E$ E  T E  E + T T  i T  (E) ( q5 i i T  i E E ( ( i q1 q8 q3 Z  E$ E  E+ T T  (E) E  E+T E  E+T T  i T  (E) + + $ ) q9 q2 Z  E$ T  (E)  T q4 E  E + T

GOTO/ACTION tables empty – error move ACTION Table GOTO Table

LR(0) parser tables • Two types of rows: • Shift row – tells which state to GOTO for current token • Reduce row – tells which rule to reduce (independent of current token) • GOTO entries are blank

Winter 2012-2013 Compiler Principles Syntax Analysis (Parsing) – Part 2