Bottom-Up Parsing

Bottom-Up Parsing CS 471 September 19, 2007

Where Are We? • Finished Top-Down Parsing • Starting Bottom-Up Parsing Lexical Analysis Syntactic Analysis Semantic Analysis

Building a Parser • Have a complete recipe for building a parser Language Grammar LL(1) Grammar Predictive Parse Table Recursive-Descent Parser Recursive-Descent Parser w/AST Gen

Bottom-Up Parsing • More general than top-down parsing • And just as efficient • Builds on ideas in top-down parsing • Preferred method in practice • Also called LR parsing • L means that tokens are read left to right • R means that it constructs a rightmost derivation

Top Down vs. Bottom Up Parsing • Bottom-up: Don’t need to figure out as much of the parse tree for a given amount of input scanned unscanned Top Down Bottom Up

An Introductory Example • Consider the following grammar: E  E + ( E ) | int • Why is this not LL(1)? • LR parsers: • Can handle left-recursion • Don’t need left factoring

The Idea • LR parsing reduces a string to the start symbol by inverting productions: str = input string of terminals repeat • Identify b in str such that A bis a production (i.e., str = abg) • Replace b by A in str (i.e., str becomes a A g) until str = G

A Bottom-up Parse in Detail (1) int + (int) + (int) int + ( int ) + ( int )

A Bottom-up Parse in Detail (2) int+ (int) + (int) E + (int) + (int) E int + ( int ) + ( int )

A Bottom-up Parse in Detail (3) int + (int) + (int) E + (int) + (int) E + (E) + (int) E E int + ( int ) + ( int )

A Bottom-up Parse in Detail (4) int + (int) + (int) E + (int) + (int) E + (E)+ (int) E + (int) E E E int + ( int ) + ( int )

A Bottom-up Parse in Detail (5) int + (int) + (int) E + (int) + (int) E + (E) + (int) E + (int) E + (E) E E E E int + ( int ) + ( int )

A Bottom-up Parse in Detail (6) E int + (int) + (int) E + (int) + (int) E + (E) + (int) E + (int) E + (E) E E E E E int + ( int ) + ( int )

G A a b b c d e Another example • Grammar: • Is “abbcde” in L(G)? • Yes “Reverse” derivation: B A

Choosing reductions • Basic algorithm: • Search for right sides of productions, reduce • Does this work? • Not always: • Problem: “aAAcde” is not part of any sentential form

How do we choose? • Important Fact #1 about bottom-up parsing: • An LR parser traces a rightmost derivation in reverse

A → b a A w a b w Why does this help? • Right-most derivation • A is the right-most non-terminal in g3 • w contains only terminal symbols • Unambiguous grammar • Right-most derivation is unique At each step, reduction is unique G →g1→g2→g3→g4→g5 → input

Notation • Split input into two substrings • Right substring (a string of terminals) is as yet unexamined by parser • Left substring has terminals and non-terminals • The dividing point is marked by a I • The Iis not part of the string • Initially, all input is unexamined: Ix1x2 . . . xn

Shift-Reduce Parsing • Bottom-up parsing uses only two kinds of actions: Shift and Reduce • Shift:Move I one place to the right • Shifts a terminal to the left string • E + (Iint )  E + (intI) • Reduce: Apply an inverse production at the right end of the left string • If E  E + ( E )is a production, then • E + (E + ( E )I)  E +(EI)

Shift-Reduce Example • Iint + (int) + (int)$ shift • intI + (int) + (int)$ red. E  int • E I+ (int) + (int)$ shift 3 times • E + (intI ) + (int)$ red. E  int • E + (E I) + (int)$ shift • E + (E)I + (int)$red. E  E + (E) • E I+ (int)$shift 3 times • E + (intI )$ red. E  int • E + (E I)$ shift • E + (E)I $ red. E  E + (E) • E I $ accept E E E E E int + ( int ) + ( int )

How do we keep track? • Left part string implemented as a stack • Top of the stack is the I • Shift: • Pushes a terminal on the stack • Reduce: • Pops 0 or more symbols off of the stack • Symbols are right-hand side of a production • Pushes a non-terminal on the stack (production LHS) • Terminology • We refer to the top set of symbols as a handle

Shift-Reduce Parsing • derivation stack input stream action • (1+2+(3+4))+5 ← (1+2+(3+4))+5 shift • (1+2+(3+4))+5 ← ( 1+2+(3+4))+5 shift • (1+2+(3+4))+5 ← (1 +2+(3+4))+5 reduce E→num • (E+2+(3+4))+5 ← (E +2+(3+4))+5 reduce S → E • (S+2+(3+4))+5 ← (S +2+(3+4))+5 shift • (S+2+(3+4))+5 ← (S+ 2+(3+4))+5 shift • (S+2+(3+4))+5 ← (S+2 +(3+4))+5 reduce E→num • (S+E+(3+4))+5 ← (S+E +(3+4))+5 reduce S→S+E • (S+(3+4))+5 ← (S +(3+4))+5 shift • (S+(3+4))+5 ← (S+ (3+4))+5 shift • (S+(3+4))+5 ← (S+( 3+4))+5 shift • (S+(3+4))+5 ← (S+(3 +4))+5 reduce E→num

Problem • • How do we know which action to take -- whether to shift or reduce, and which production? • • Sometimes can reduce but shouldn’t • –e.g., X → ε can always be reduced • • Sometimes can reduce in different ways

Action Selection Problem • Given stack σ and look-ahead symbol b, should parser: • shiftb onto the stack (making it σb) • reduce some production X → γ assuming that stack has the form  γ (making it X) • If stack has form  γ, should apply reduction X → γ (or shift) depending on stack prefix  •  is different for different possible reductions, since γ’s have different length. • How to keep track of possible reductions?

Parser States • Goal: know what reductions are legal at any given point • Idea: summarize all possible stack prefixes  as a finite parser state • Parser state is computed by a DFA that reads in the stack  • Accept states of DFA: unique reduction! • Summarizing discards information • affects what grammars parser handles • affects size of DFA (number of states)

LR(0) Parser • Left-to-right scanning, Right-most derivation, “zero” look-ahead characters • • Too weak to handle most language grammars (e.g., “sum” grammar) • • But will help us understand shift-reduce parsing

LR(0) States • • A state is a set of items keeping track of progress on possible upcoming reductions • • An LR(0) itemis a production from the language with a separator “.” somewhere in the RHS of the production • • Stuff before “.” is already on stack (beginnings of possible γ’s to be reduced) • • Stuff after “.” : what we might see next • • The prefixes  represented by state itself state E→num ● E→ (● S ) item

S →( L ) | id L →S | L , S Start State & Closure • Constructing a DFA to read stack: • • First step: augment grammar with prod’n S →S $ • • Start state of DFA: empty stack = S → . S $ • • Closure of a state adds items for all productions whose LHS occurs in an item in the state, just after “.” • Set of possible productions to be reduced next • Added items have the “.” located at the beginning: no symbols for these items on the stack yet closure S →. S $ S → . ( L ) S → . id S →. S $

S →( L ) | id L →S | L , S Applying Terminal Symbols • In new state, include all items that have appropriate input symbol just after dot, advance dot in those items, and take closure. S →( . L ) L → . S L → . L , S S →. ( L ) S → . id S ’ → . S $ S → . ( L ) S → . id ( id S → id id (

S →( L ) | id L →S | L , S Applying Nonterminal Symbols • • Non-terminals on stack treated just like terminals (except added by reductions) S →( . L ) L → . S L → . L , S S →. ( L ) S → . id S →( L . ) L → L . , S S ’ → . S $ S → . ( L ) S → . id L ( S L → S . id S → id id (

Applying Reduce Actions • • Pop RHS off stack, replace with LHS X (X→γ) S →( . L ) L → . S L → . L , S S →. ( L ) S → . id S →( L . ) L → L . , S S ’ → . S $ S → . ( L ) S → . id L ( S L → S . id ( S → id . id States causing reductions

S →( L ) | id L →S | L , S Full DFA (Appel p. 62) 2 id 8 id • • reduce-only state: reduce • • if shift transition for look-ahead: shift otherwise: syntax error • • current state: push stack through DFA 1 S ’ → . S $ S → . ( L ) S → . id S →id . L → L , . S S → . ( L ) S → . id 9 S id L → L , S . ( 3 S →( . L ) L → . S L → . L , S S →. ( L ) S → . id , ( 5 L S → ( L . ) L → L . , S S ( S ) 4 7 6 L → S . S → ( L ) . S ’ → S . $ $ final state

S →( L ) | id L →S | L , S Parsing Example: ((x),y) • derivation stack input action • ((x),y) ← 1 ((x),y) shift, goto 3 • ((x),y) ← 1 (3 (x),y) shift, goto 3 • ((x),y) ← 1 (3 (3 x),y) shift, goto 2 • ((x),y) ← 1 (3 (3 x2 ),y) reduce Sid • ((S),y) ← 1 (3 (3S7 ),y) reduce LS • ((L),y) ← 1 (3 (3L5 ),y) shift, goto 6 • ((L),y) ← 1 (3 (3L5)6 ,y) reduce S(L) • (S,y) ← 1 (3S7 ,y) reduce LS • (L,y) ← 1 (3L5 ,y) shift, goto 8 • (L,y) ← 1 (3L5 , 8 y) shift, goto 9 • (L,y) ← 1 (3L5 , 8 y2 ) reduce Sid • (L,S) ← 1 (3L5 , 8S9 ) reduce LL,S • (L) ← 1 (3L5 ) shift, goto 6 • (L) ← 1 (3L5 )6 reduce S(L) • S 1S4$ done

next action next state Implementation: LR Parsing Table input (terminal) symbols non-terminal symbols state state Action table Used at every step to decide whether to shift or reduce Goto table Used only when reducing, to determine next state X   ▪ a X

next actions next state on red’n Shift-Reduce Parsing Table terminal symbols non-terminal symbols • Action table • 1. shift and goto state n • 2. reduce using X → γ • pop symbols γ off stack • using state label of top (end) of stack, look up X in goto table and goto that state • • DFA + stack = push-down automaton (PDA) state

List Grammar Parsing Table

Shift-Reduce Parsing • Grammars can be parsed bottom-up using a DFA + stack • DFA processes stack σ to decide what reductions might be possible given • shift-reduce parser or push-down automaton (PDA) • Compactly represented as LR parsing table • State construction converts grammar into states that decide action to take

Checkpoint • • Limitations of LR(0) grammars • • SLR, LR(1), LALR parsers • • Automatic parser generators

LR(0) Limitations • • An LR(0) machine only works if states with reduce actions have a single reduce action – in those states, always reduce ignoring lookahead • • With more complex grammar, construction gives states with shift/reduce or reduce/reduce conflicts • • Need to use look-ahead to choose ok shift/reduce reduce/reduce L → L , S . L → L , S . S → S ., L L → L , S . L → S .

LR(0) Construction S→ E + S | E E→num | ( S ) 1 S’ →. S $ S→ . E + S S→ . E E→ . num E→ . ( S ) 2 E S→E . + S S→E . + 3 S→E + . S What do we do in state 2?

SLR grammars • Idea: Only add reduce action to table if look-ahead symbol is in the FOLLOW set of the non-terminal being reduced • • Eliminates some conflicts • • FOLLOW(S) = { $, ) } • • Many language grammars are SLR

LR(1) Parsing • • As much power as possible out of 1 lookahead symbol parsing table • • LR(1) grammar = recognizable by a shift/reduce parser with 1 look-ahead. • • LR(1) item = LR(0) item + look-ahead symbols possibly following production • LR(0): S→ .S + E • LR(1): S→ .S + E +

LR(1) State • • LR(1) state = set of LR(1) items • • LR(1) item = LR(0) item + set of lookahead symbols • • No two items in state have same production + dot configuration S→S . + E + S→S . + E $ S→S + . E num S→S . + E +,$ S→S + . E num

LR(1) Closure • Consider A→β . C δ λ Closure formed just as for LR(0) except • Look-ahead symbols include characters following the non-terminal symbol to the right of dot: FIRST(δ) • If non-terminal symbol may produce last symbol of production (δ is nullable), look-ahead symbols include look-ahead symbols of production (λ) S →. S $ S→ . E + S $ S→ . E $ E→ . num +,$ E→ . ( S ) +,$ 1 S→ E + S | E E→num | ( S ) 2

LR(1) DFA construction • Given LR(1) state, for each symbol (terminal or non-terminal) following a dot, construct a state with dot shifted across symbol, perform closure S→ E + S | E E→num | ( S ) 1 S’ →. S $ S→ . E + S $ S→ . E $ E→ . num +,$ E→ . ( S) +,$ 2 S→E . + S $ S→E . $ E

LR(1) example • Reductions unambiguous if: look-aheads are disjoint, not to right of any dot in state S→ E + S | E E→num | ( S ) 1 S’ →. S $ S→ . E + S $ S→ . E $ E→ . num +,$ E→ . ( S) +,$ 2 S→E . + S $ S→E . $ E

LALR Grammars • • Problem with LR(1): too many states • • LALR(1) (Look-Ahead LR) • Merge any two LR(1) states whose items are identical except look-ahead • Results in smaller parser tables—works extremely well in practice • Usual technology for automatic parser generators S→id .+ S→E .$ S→id . $ S→E .+ + = ?

How are Parsers Written? • Automatic parser generators: yacc, bison • Accept LALR(1) grammar specification • plus: declarations of precedence, associativity • output: LR parser code (inc. parsing table) • Some parser generators accept LL(1) • less powerful

Associativity • S→ S + E | E • E→num | ( S ) • E→ E + E | num | ( E ) • What happens if we run this grammar • through LALR construction?

Conflict! • E→ E + E | num | ( E ) E→ E + E . + E→ E . + E +,$ shift/reduce conflict Shift: 1+(2+3) Reduce: (1+2)+3 1+2+3 ^

Bottom-Up Parsing