Understanding Syntax and Parsing in Programming Languages

Syntax Juan Carlos Guzmán CS 3123 Programming Languages Concepts Southern Polytechnic State University

What does your DOS computer do when …? • > copy a.txt b.txt • > copy a.txt a.txt • > del *.* • > del *01.* • > type a.txt > null: • > type a.txt > nul:

How do we know the meaning of our commands?

Semiotic • Synthesized from Merriam-Webster (m-w.com) • a general philosophical theory of signs and symbols that deals especially with their function in both artificially constructed and natural languages and comprises: • syntactics • the formal relations between signs or expressions in abstraction from their signification and their interpreters • semantics • the relations between signs and what they refer to • pragmatics • the relation between signs or linguistic expressions and their users

Syntax • Two levels: • The language level, properly known as parsing • The lexeme level, known as lexing • More information about this topic can be found in • Aho, Sethi, Ullman. Compilers: Principles, Techniques, and Tools. Addison-Wesley, 1988. (on reserve, The Dragon book)

Lexing • Specification of the lexemes of the language • A class of lexemes is known as a token • Tokens are specified in regular expressions: • letter, empty string • concatenation • choice • closure • Many convenient extensions • Recognized by Finite Automata • Limited in Power: cannot count, cannot recognize anbn

Sample Regular Expressions • digit::= (0 | 1 | 2 | 3 | 4 | 5 | 6 |7 | 8 | 9) • ldigit::= (1 | 2 | 3 | 4 | 5 | 6 |7 | 8 | 9) • natural::= ldigit digit* • integer::= (+ | - | ) (natural | 0) • How about floating points? • W/o exponents • add the exponents

Parsing • Specification of the language structure • The parser • recognizes the phrase, and • reconstructs its structure (parse tree)

Context-Free Grammars • Generate Context-Free Languages • Allow recursion • Are specified as G=(N,T,P,S) where • N is the set of “non-terminals”, or variables • T is the alphabet • P the “production set” • S the starting symbol for every phrase

CFG (Example) • G1 = ({S,A,B}, {a,b}, P, S) where P = {SASB, SBSA, S , A a, B b} • G2 = ({E}, {a,+,*,(,)}, P, E) where P = {EE+E, EE*E, Ea, E (E)}

Grammars (conventions) • The empty string:  • First uppercase letters of the alphabet (A, B, C, …) => Non-terminal • First lowercase letters of the alphabet (a, b, c, …), or numbers (1, 2, …) => Terminal • First lowercase greek letters (, , ,…), => string of terminals and non-terminals • Last lowercase letters of the alphabet (t, u, v,…) => string of terminals

Derivation • How do we generate phrases in the language? • By using a derivation: A =>  iff A  P • E => E+E => E+E*E => a+E*E => a+E*a => a+a*a

The Language Generated • The language generated by the grammar is composed of all strings of terminals that can be derived from S by applying productions rules one or more times • Anything derived from S is called a sentential form

Derivations • Leftmost derivation: the leftmost non-terminal is always reduced: E => E*E => E+E*E => a+E*E => a+a*E=> a+a*a • Rightmost derivation: the rightmost non-terminal is always reduced: E => E+E => E+E*E => E+E*a => E+a*a => a+a*a

E E E + E E * E a a E * E E + E a a a a Parse Tree • A structured sequence of derivations • Visually appealing • From previous example:

Ambiguous Grammar • Two different parse trees for a single phrase • Just one phrase with two trees is proof of ambiguity • Not ambiguous? All phrases must have only one parse tree! • An ambiguous grammar is quite different from an inherently ambiguous language

Grammars vs. Languages • A language is a set • A grammar is a medium by which the set can be formally specified • Many grammars specify the same set

An Expression Grammar • The grammar for expressions presented before was ambiguous • Non-ambiguous, with correct precedence (relative priority given to + and *): EE + T | T TT * F | F Fa | ( E ) E E + T T T * F a F F a a

Parsing Styles • Top-down: to derive w from S, start from S, derive until w is obtained • Bottom-up: to derive w from S, try doing ‘reverse derivations’ from w until S is obtained

Parsing Styles • Top-down: LL(k) • Easy to implement and understand • hand-coded • table-driven • Limited use, many problems • Bottom-up: LR(k) • More difficult to understand • table driven • A nice trade-off between complexity and generality

An Expression Grammar G = ({E,T,F},{a,+,*,(,)},P,E) where P = {ET+E | T, TF*T | F, Fa | (E) } Does a+a*a in L(G)? E T + E F T * T a F a F a

A Grammar for a Small Language programbeginstmt_listend stmt_liststmt stmt;stmt_list stmtvar=expression varABC expressionvar+var var-var var

Predictive Parsing • How many characters of look-ahead are needed to predict the next production to take? • Is this a finite number? • Is it 1?

Another Expression Grammar G’ = ({E,E’,T,T’,F},{a,+,*,(,)},P,E) where P = {ETE’, E’+TE’ | , TFT’, T’*FT’ | , Fa | (E) } Does a+a*a in L(G’)? E T E’ F T’ + T E’  a  F T’ a * F T’ a 

LL(1) Parsing Table

LL(1) Algorithm input stack Parse(a1 … an, X1 … Xm) { if (a1=$) & (X1=$) accept else if X1 is a terminal and (X1=a1) Parse(a2 … an, X2 … Xm) // match else if Table[X1,a1] = X1Y1 … Yk Parse(a1 … an, Y1 … YkX2 … Xm) / derive else fail } • Call initially with Parse(w$,S$), where w is the phrase to parse and S is the starting symbol of the grammar ai is a terminal Xj Yk are terminals or nonterminals

INPUT a + a * a $ a + a * a $ a + a * a $ a + a * a $ + a * a $ + a * a $ + a * a $ a * a $ a * a $ a * a $ * a $ * a $ a $ a $ $ $ $ STACK E $ T E’ $ F T’ E’ $ a T’ E’ $ T’ E’ $ E’ $ + T E’ $ T E’ $ F T’ E’ $ a T’ E’ $ T’ E’ $ * F T’ E’ $ F T’ E’ $ a T’ E’ $ T’ E’ $ E’ $ $ Parser Operation on a+a*a # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 OPERATION derive derive derive match derive derive match derive derive match derive match derive match derive derive accept Sentential Form E $ T E’ $ F T’ E’ $ a T’ E’ $ a T’ E’ $ a E’ $ a + T E’ $ a + T E’ $ a + F T’ E’ $ a + a T’ E’ $ a + a T’ E’ $ a + a * F T’ E’ $ a + a * F T’ E’ $ a + a * a T’ E’ $ a + a * a T’ E’ $ a + a * a E’ $ a + a * a $

Note how the leftmost derivation of a+a*a is done Sentential Form E $ T E’ $ F T’ E’ $ a T’ E’ $ a T’ E’ $ a E’ $ a + T E’ $ a + T E’ $ a + F T’ E’ $ a + a T’ E’ $ a + a T’ E’ $ a + a * F T’ E’ $ a + a * F T’ E’ $ a + a * a T’ E’ $ a + a * a T’ E’ $ a + a * a E’ $ a + a * a $ E T E’ F T’ + T E’  a  F T’ a * F T’ a 

What’s the Table Lookup • Note that the predictive nature of the parser guarantees the uniqueness of the entry for Table[A,b] (or no entry at all) • When attempting to derive nonterminal A, the look-ahead b must give the correct rule to apply • This b can be • the initial character of the derivation of A, i.e., A *b, • or, it can be the initial character of the derivation of what follows A! (A *)

First Sets • first() is the set of one-character prefixes of strings of terminals that can be derived from  • If the empty string can be derived from , then it will also be in the set • if  * aw then a  first() • if  *  then   first()

First Sets (II) • first() ={} • first(a) ={a} • first(A) = first(1)  …  first(n) if A 1  P, …, A n  P • first(X) = first(X)first() where X is either terminal or nonterminal

Bounded Concatenation • In computing first(X), our interest is to obtain one-character prefixes (or ) • Consider the operation at the char level •    = , where  is either  or a terminal • a   = a • Generalize it to work on sets • AB = {vw | vA, wB}, where A & B are sets

Computation of First Sets

Follow Sets • Follow(A) is the set of prefixes of strings of terminals that can follow any derivation of A in G • $ follow(S) • if(BA)P, then • first()follow(B) follow(A) • The definition of follow usually results in recursive set definitions. In order to solve them, you need to do several iterations on the equations •  never appears in any follow set • Note: I had promised a closed definition of follow, but it will be unnecessarily complex. JCG.

Computation of Follow Sets

How to Fill In the Table (Predict) • For each production (A)P let X=first()follow(A) then for all xX B Table[A,x] • After processing all productions, each cell of the table must have, at most, one production • if not, your grammar is not LL(1) (nice try!)

First & Follow Sets

Predict

Yet Another Expression Grammar (it’s in the book!) G = ({E,T,F},{a,+,*,(,)},P,E) where P = { EE+T,  ET, TT*F,  TF, F(E),  Fa} Does a+a*a in L(G)? E E + T * T T F a F F a a

LR(1) Parsing Table Sn:shift to staten Rn:reduce according to productionn

LR(1) Algorithm stack input Parse(S0X1S1X2S2 … XrSr … XmSm,a1 … an) { if Action[Sm,a1] == Shift S Parse(S0X1S1X2S2 … XmSma1S,a2 … an) else if Action[Sm,a1] == Reduce AXr+1…Xm and GOTO[Sr,A] == S Parse(S0X1S1X2S2 … XrS,a1 … an) else if Action[Sm,a1] == Accept accept else if Action[Sm,a1] == Error error } • Call initially with Parse(S0,w$), where w is the phrase to parse and S0 is the initial state of the table ai is a terminal Xj Yk are terminals or nonterminals Si is a “state”

STACK 0 0 a 5 0 F 3 0 T 2 0 E 1 0 E 1 + 6 0 E 1 + 6 a 5 0 E 1 + 6 F 3 0 E 1 + 6 T 9 0 E 1 + 6 T 9 * 7 0 E 1 + 6 T 9 * 7 a 5 0 E 1 + 6 T 9 * 7 F 10 0 E 1 + 6 T 9 0 E 1 Parser Operation on a+a*a # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 INPUT a + a * a $ + a * a $ + a * a $ + a * a $ + a * a $ a * a $ * a $ * a $ * a $ a $ $ $ $ $ OPERATION S 5 R 6, G[0,F] R 4, G[0,T] R 2, G[0,E] R 6 S 5 R 6, G[6,F] R 4, G[6,T] S 7 S 5 R 6, G[7,F] R 3, G[7,T] R 1, G[0,E] accept Sentential Form a + a * a $ a + a * a $ F + a * a $ T + a * a $ E + a * a $ E + a * a $ E + a * a $ E + F * a $ E + T * a $ E + T * a $ E + T * a $ E + T * F $ E + T $ E $

Note how the rightmost derivation of a+a*a is done Sentential Form E $ E + T $ E + T * F $ E + T * a $ E + T * a $ E + T * a $ E + F * a $ E + a * a $ E + a * a $ E + a * a $ T + a * a $ F + a * a $ a + a * a $ a + a * a $ E E + T * T T F a F F a a

Understanding Syntax and Parsing in Programming Languages

Understanding Syntax and Parsing in Programming Languages

Presentation Transcript

Syntax

SYNTAX

Syntax

Syntax

Syntax

Syntax

Syntax

SYNTAX

Syntax

Syntax

Syntax

Syntax

Syntax

Syntax

Syntax

SYNTAX

Syntax

Syntax

Syntax

Syntax

SYNTAX

Syntax