Discussion #3 Grammar Formalization & Parse-Tree Construction

Discussion #3Grammar Formalization& Parse-Tree Construction

Topics • Grammar Definitions • Parse Trees • Constructing Parse Trees

Formal Definition of a Grammar A grammar G is a 4-tuple: G = (VN, VT, S, ), where • VN , VT , sets of non-terminal and terminal symbols • SVN, a start symbol •  = a finite set of relations from (VT  VN)+ to (VT  VN)* • an element of , (, ), is written as    and is called a production rule or a rewriting rule

Examples of Grammars

Definition of a Context-Free Grammar • A context-free grammar is a grammar with the following restriction: • The relation  is a finite set of relations from VN to (VT  VN)+ • i.e. the left hand side of a production is a single non-terminal • i.e. the right hand side of any production cannot be empty • Context-free grammars generate context-free languages. With slight variations, essentially all programming languages are context-free languages.

Examples of Grammars (again) Which are context-free grammars?

Backus-Naur Form (BNF) • A traditional meta language to represent grammars for programming languages • Every non-terminal is enclosed in < and > • Instead of the symbol  we use ::= • Example • I  L | ID | IL • L  a | b | … | z • D  0 | 1 | … | 9 • BNF: • <I> ::= <L> | <I><D> | <I><L> • <L> ::= a | b | … | z • <D> ::= 0 | 1 | … | 9

Definition: Direct Derivative Let G = (VN, VT, S, ) be a grammar and ,   (VN  VT)*,  is said to be a direct derivative of , (written   ) if there are strings 1 and 2 (including possibly empty strings) such that  = 1B2,  = 12, B  VN and B   is a production of G.

Example: Direct Derivatives G = (VN, VT, S, ), where: VN = {I, L, D} VT = {a, b, …, z, 0, 1, …, 9} S = I  = { I  L | ID | IL L  a | b | … | z D  0 | 1 | … | 9 }

Definition: Derivation Let G = (VN, VT, S, ) be a grammar A string  produces  ( reduces to  or  is the derivation of , written  + ), if there are strings 0, 1, …, n (n>0) such that  = 0  1, 1  2, …, n-1  n, n  .

Example: Derivation • LetG = (VN, VT, S, ), where: VN = {I, L, D} VT = {a, b, …, z, 0, 1, …, 9} S = I  = { I  L | ID | IL L  a | b | … | z D  0 | 1 | … | 9 } • I produces abc12 I  ID  IDD  ILDD  ILLDD  LLLDD  aLLDD  abLDD  abcDD  abc1D  abc12

Definition: Language • A sentential form is any derivative of the start symbol S. • A language L generated by a grammar G is the set of all sentential forms whose symbols are all terminals; that is, L(G) = { | S +  and   VT*}

Example: Language • LetG = (VN, VT, S, ), where: VN = {I, L, D} VT = {a, b, …, z, 0, 1, …, 9} S = I  = { I  L | ID | IL L  a | b | … | z D  0 | 1 | … | 9 } • I produces abc12 • L(G) = {abc12, x, m934897773645, a1b2c3, …} I  ID  IDD  ILDD  ILLDD  LLLDD  aLLDD  abLDD  abcDD  abc1D  abc12

Syntax Analysis: Parsing • The parse of a sentence is the construction of a derivation for that sentence • The parsing of a sentence results in • acceptance or rejection • and, if acceptance, then also a parse tree • We are looking for an algorithm to parse a sentence (i.e. to parse a program) and produce a parse tree.

Parse Trees • A parse tree is composed of • interior nodes representing syntactic categories (non-terminal symbols) • leaf nodes representing terminal symbols • For each interior node N, the transition from N to its children represents the application of a production.

Parse Tree Construction • Top-down • Starts with the root (starting symbol) • Proceeds downward to leaves using productions • Bottom-up • Starts from leaves • Proceeds upward to the root • Although these seem like reasonable approaches to develop a parsing algorithm, we’ll see that neither works well  so we’ll need to find a better way.

E * E E + E D D D 4 2 3 Example: Top-Down Parse for 4 * 2 + 3 • VN = {E, D} • VT = {0, 1, …, 9, +, , *, /, (, )} • S = E • = { E  D | ( E ) | E + E| E – E | E * E| E / E , • D  0 | 1 | … | 9 } E • Problems: • How do we guess • which rule applies? • Note that we produced • the wrong parse tree • (precedence is wrong)

E E E + E E E * D E + E E * E D 3 D D 4 D D 4 2 2 3 Ambiguous GrammarTwo Different Parse Trees for 4*2+3 • = { E  D | ( E ) | E + E| E – E | E * E| E / E , D  0 | 1 | … | 9 }

A ( A + A ) ( ( A * A ) + A ) ( ( A * ( A + A ) ) + I ) ( ( V * ( V + V ) ) + I D) Problem: I ?? D ( ( L * ( L + L ) ) + D D) Example: Bottom-Up Parse Problem: scanning the entire program repeatedly • A  V | I | (A + A) | (A * A) • V  L | VL | VD • I  D | ID • D  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • L  x | y | z ( ( z * ( x + y ) ) + 1 2 )

So, how do we develop a parsing algorithm? • “Fix” the grammar • So that we can go top down, left to right, with no backup • LL(1) grammar: Left-to-right, Left-most non-terminal, one symbol look ahead • “Fix” (How?) • Observe grammar properties: determine what’s needed to make them LL(1) • Transform grammars to make them LL(1) • Note: works for many grammars, but not all

Discussion #3 Grammar Formalization & Parse-Tree Construction