CPSC 388 – Compiler Design and Construction

CPSC 388 – Compiler Design and Construction Implementing a Parser LL(1) and LALR Grammars FBI Noon Dining Hall Vicki Anderson Recruiter

Announcements • PROG 3 out, due Oct 9th • Get started NOW! • HW due Friday • HW6 posted, due next Friday

Parsing using CFGs • Algorithms can parse using CFGs in O(n3) time (n is the number of characters in input stream) – TOO SLOW • Subclasses of grammars can be parsed in O(n) time • LL(1) 1 token of look ahead Do a left most derivation Scan input from left to right • LALR(1) one token of look-ahead do a rightmost derivation in reverse scan the input left-to-right LA means "look-ahead“(nothing to do with the number of tokens)

LALR(1) • More general than LL(1) grammars (Every LL(1) grammar is a LALR(1) grammar but not vice versa) • Class of grammars used by java_cup, Bison, YACC • Parsed bottom up (start with non-terminals and build tree from leaves up to root) • Covered in text section 4.6-4.7 • For class need to understand details of just LL(1) grammars

LL(1) Grammars – Predictive Parsers • “build” parse tree top-downactually discover tree top-down, don’t actually build it • Keep track of work to be done using a stack • Scanned tokens along with stack correspond to leaves of incomplete tree • Use parse table to decide how to parse input • Rows are non-terminals • Columns are tokens (plus EOF token) • Cells are the bodies of production rules

Predictive Parser Algorithm s.push(EOF) // special EOF terminal s.push(start) // start is start non-terminal x=s.peek() t=scanner.next_token() While (x != EOF): if x==t: s.pop() t=scanner.next_token() else: if x is terminal: error else: if table[x][t]==empty: error else: let body=table[x][t] //body of production output x→body s.pop() s.push(…) //push body from right to left x=s.peek()

Example Parse using algorithm • Consider the language of balanced parentheses and brackets, e.g. ([]) • Input String is “([])EOF” • Grammar: S →ε | ( S ) | [ S ] • Parse Table:

Not All Grammars LL(1) • Not all Grammars are LL(1): S → ( S ) | [ S ] | ( ) | [ ] • If input is ( don’t know which rule to use! • Try input “[[]]” to LL(1) grammar using predictive parser • Draw input seen so far • Stack • Action taken

Is Grammar LL(1) • Given a grammar how do you tell if it is LL(1)? • How to build the parse table? • If parse table is built and only one entry per cell then LL(1)

Non-LL(1) Grammars • If a grammar is left-recursive • If a grammar is not left-factored • It is sometimes possible to change a grammar to remove left-recursion and to make it left-factored

Left-Recursion • Grammar g is recursive if there exists a production such that: Recursive Left recursive Right recursive

Removing Immediate Left-Recursion • Consider the grammar A → Aα | β • A is a nonterminal • α a sequence of terminals and/or nonterminals • β is a sequence of terminals and/or nonterminals not starting with A • Replace production with A →β A’ A’ →α A’ | ε • Two grammars are equivalent (recognize same set of input strings)

You Try it • Remove left recursion from the grammar: exp → exp - factor | factor factor → INTLITERAL | ( exp ) • Construct parse tree using original grammar and new grammar using input “5-3-2” • In general more difficult than this to remove left recursion, see text 4.3.3

Left Factored • A grammar is NOT left-factored if a non-terminal has two productions whose bodies have common prefixes exp → ( exp ) | ( ) • A top-down predictive parser would not know which production rule to use when seeing input character of “(“

Left Factoring • Given a pair of productions: A →αβ1 | αβ2 • α is sequence of terminals and non-terminals • β1 and β2 are sequence of terminals and non-terminals but don’t have common prefix (may be epsilon) • Change to: A →α A’ A’ →β1 | β2

Left Factoring Example • So for grammar exp → ( exp ) | ( ) • It becomes exp → ( exp’ exp’ → exp ) | )

You Try It • Remove left recursion and do left factoring for grammar exp → ( exp ) | exp exp | ( )

Building Parse Tables • Recall a parse table • Every row is a non-terminal • Every column is an input token • Every cell contains a production body • If any cell contains more than one production body then grammar is not LL(1) • To build parse table need to have FIRST set and FOLLOW set

FIRST set • FIRST(α) α is some sequence of terminals and non-terminals FIRST(α) is set of terminals that begin the strings derivable from α if α can derive ε, then ε is in FIRST(α)

FIRST(X) • X is a single terminal, non-terminal or ε • FIRST(X)={X} //X is terminal • FIRST(X)={ε} //X is ε • FIRST(X)=… //X is non-terminal • Look at all productions rules with X as head • For each production rule, X →Y1,Y2,…Yn • Put FIRST(Y1) - {ε} into FIRST(X). • If ε is in FIRST(Y1), then put FIRST(Y2) - {ε} into FIRST(X). • If ε is in FIRST(Y2), then put FIRST(Y3) - {ε} into FIRST(X). • etc... • If ε is in FIRST(Yi) for 1 <= i <= n (all production right-hand side

Example FIRST Sets • Compute FIRST sets for each non-terminal: exp → term exp’ exp’ → - term exp’ | ε term → factor term’ term’ → / factor term’ | ε factor → INTLITERAL | ( exp ) { INTLITERAL, ( } { /, ε } { INTLITERAL, ( } { -, ε } {INTLITERAL, ( }

FIRST(α) for any α • α is of the form X1, X2, …, Xn • Where each X is a terminal, non-terminal or ε • Put FIRST(X1) - {ε} into FIRST(α) • If epsilon is in FIRST(X1) put FIRST(X2) into FIRST(α). • etc... • If ε is in the FIRST set for every Xn, put ε into FIRST(α).

Example FIRST sets for rules FIRST( term exp' ) = { INTLITERAL, ( } FIRST( - term exp' ) = { - } FIRST(ε ) = {ε } FIRST( factor term' ) = { INTLITERAL, ( } FIRST( / factor term' ) = { / } FIRST(ε ) = {ε } FIRST( INTLITERAL ) = { INTLITERAL } FIRST( ( exp ) ) = { ( }

Why Do We Care about FIRST(α)? • During parsing, suppose the top-of-stack symbol is nonterminal A, that there are two productions: • A →α • A →β • And that the current token is x • If x is in FIRST(α) then use first production • If x is in FIRST(β) then use second production

FOLLOW(A) sets • Only defined for singlenon-terminals, A • the set of terminals that can appear immediately to the right of A (may include EOF but never ε)

Calculating FOLLOW(A) • If A is start non-terminal put EOF in FOLLOW(A) • Find productions with A in body: • For each production X →α A β • put FIRST(β) – {ε} in FOLLOW(A) • If ε in FIRST(β) put FOLLOW(X) into FOLLOW(A) • For each production X →α A • put FOLLOW(X) into FOLLOW(A)

FIRST and FOLLOW sets • To compute FIRST(A) you must look for A on a production's left-hand side. • To compute FOLLOW(A) you must look for A on a production's right-hand side. • FIRST and FOLLOW sets are always sets of terminals (plus, perhaps, ε for FIRST sets, and EOF for follow sets). • Nonterminals are never in a FIRST or a FOLLOW set.

Example FOLLOW sets CAPS are non-terminals and lower-case are terminals S → B c | D B B → a b | c S D → d | ε X FIRST(X) FOLLOW(X) ------------------------------------------- D { d, ε } { a, c } B { a, c } { c, EOF } S { a, c, d } { EOF, c } Note: FOLLOW of S always includes EOF

You Try It • Computer FIRST and FOLLOW sets for: methodHeader → VOID ID LPAREN paramList RPAREN paramList → epsilon paramList → nonEmptyParamList nonEmptyParamList → ID ID nonEmptyParamList → ID ID COMMA nonEmptyParamList • Remember you need FIRST and FOLLOW sets for all non-terminals and FIRST sets for all bodies of rules

Parse Table Current Token Non-terminals Rule bodies

Parse Table Construction Algorithm for each production X →α: for each terminal t in First(α): put α in Table[X,t] if ε is in First(α) then: for each terminal t in Follow(X): put α in Table[X,t]

Example Parse Table Construction S → B c | D B B → a b | c S D → d | ε For this grammar: • Construct FIRST and FOLLOW Sets • Apply algorithm to calculate parse table

Example Parse Table Construction X FIRST(X) FOLLOW(X) --------------------------------------------------- D { d, ε } { a, c } B { a, c } { c, EOF } S { a, c, d } { EOF, c } Bc { a, c } DB { d, a, c } ab { a } cS { c } D { d } Ε{ε }

Parse Table Finish Filling In Table

Predictive Parser Algorithm s.push(EOF) // special EOF terminal s.push(start) // start is start non-terminal x=s.peek() t=scanner.next_token() While (x != EOF): if x==t: s.pop() t=scanner.next_token() else: if x is terminal: error else: if table[x][t]==empty: error else: let body=table[x][t] //body of production output x→body s.pop() s.push(…) //push body from right to left x=s.peek()

CPSC 388 – Compiler Design and Construction