Understanding Syntax and Parsing in Programming Languages
This text explores the fundamental concepts of syntax, semantics, and pragmatics in programming languages, drawing from semiotics and compiler design principles. It discusses the two levels of syntax: parsing and lexing, alongside the role of context-free grammars in language generation. Moreover, it delves into the significance of derivation techniques (leftmost and rightmost) and parsing styles (top-down and bottom-up). By examining various grammar examples and parsing strategies, it provides insights into the structure and behavior of programming languages.
Understanding Syntax and Parsing in Programming Languages
E N D
Presentation Transcript
Syntax Juan Carlos Guzmán CS 3123 Programming Languages Concepts Southern Polytechnic State University
What does your DOS computer do when …? • > copy a.txt b.txt • > copy a.txt a.txt • > del *.* • > del *01.* • > type a.txt > null: • > type a.txt > nul:
Semiotic • Synthesized from Merriam-Webster (m-w.com) • a general philosophical theory of signs and symbols that deals especially with their function in both artificially constructed and natural languages and comprises: • syntactics • the formal relations between signs or expressions in abstraction from their signification and their interpreters • semantics • the relations between signs and what they refer to • pragmatics • the relation between signs or linguistic expressions and their users
Syntax • Two levels: • The language level, properly known as parsing • The lexeme level, known as lexing • More information about this topic can be found in • Aho, Sethi, Ullman. Compilers: Principles, Techniques, and Tools. Addison-Wesley, 1988. (on reserve, The Dragon book)
Lexing • Specification of the lexemes of the language • A class of lexemes is known as a token • Tokens are specified in regular expressions: • letter, empty string • concatenation • choice • closure • Many convenient extensions • Recognized by Finite Automata • Limited in Power: cannot count, cannot recognize anbn
Sample Regular Expressions • digit::= (0 | 1 | 2 | 3 | 4 | 5 | 6 |7 | 8 | 9) • ldigit::= (1 | 2 | 3 | 4 | 5 | 6 |7 | 8 | 9) • natural::= ldigit digit* • integer::= (+ | - | ) (natural | 0) • How about floating points? • W/o exponents • add the exponents
Parsing • Specification of the language structure • The parser • recognizes the phrase, and • reconstructs its structure (parse tree)
Context-Free Grammars • Generate Context-Free Languages • Allow recursion • Are specified as G=(N,T,P,S) where • N is the set of “non-terminals”, or variables • T is the alphabet • P the “production set” • S the starting symbol for every phrase
CFG (Example) • G1 = ({S,A,B}, {a,b}, P, S) where P = {SASB, SBSA, S , A a, B b} • G2 = ({E}, {a,+,*,(,)}, P, E) where P = {EE+E, EE*E, Ea, E (E)}
Grammars (conventions) • The empty string: • First uppercase letters of the alphabet (A, B, C, …) => Non-terminal • First lowercase letters of the alphabet (a, b, c, …), or numbers (1, 2, …) => Terminal • First lowercase greek letters (, , ,…), => string of terminals and non-terminals • Last lowercase letters of the alphabet (t, u, v,…) => string of terminals
Derivation • How do we generate phrases in the language? • By using a derivation: A => iff A P • E => E+E => E+E*E => a+E*E => a+E*a => a+a*a
The Language Generated • The language generated by the grammar is composed of all strings of terminals that can be derived from S by applying productions rules one or more times • Anything derived from S is called a sentential form
Derivations • Leftmost derivation: the leftmost non-terminal is always reduced: E => E*E => E+E*E => a+E*E => a+a*E=> a+a*a • Rightmost derivation: the rightmost non-terminal is always reduced: E => E+E => E+E*E => E+E*a => E+a*a => a+a*a
E E E + E E * E a a E * E E + E a a a a Parse Tree • A structured sequence of derivations • Visually appealing • From previous example:
Ambiguous Grammar • Two different parse trees for a single phrase • Just one phrase with two trees is proof of ambiguity • Not ambiguous? All phrases must have only one parse tree! • An ambiguous grammar is quite different from an inherently ambiguous language
Grammars vs. Languages • A language is a set • A grammar is a medium by which the set can be formally specified • Many grammars specify the same set
An Expression Grammar • The grammar for expressions presented before was ambiguous • Non-ambiguous, with correct precedence (relative priority given to + and *): EE + T | T TT * F | F Fa | ( E ) E E + T T T * F a F F a a
Parsing Styles • Top-down: to derive w from S, start from S, derive until w is obtained • Bottom-up: to derive w from S, try doing ‘reverse derivations’ from w until S is obtained
Parsing Styles • Top-down: LL(k) • Easy to implement and understand • hand-coded • table-driven • Limited use, many problems • Bottom-up: LR(k) • More difficult to understand • table driven • A nice trade-off between complexity and generality
An Expression Grammar G = ({E,T,F},{a,+,*,(,)},P,E) where P = {ET+E | T, TF*T | F, Fa | (E) } Does a+a*a in L(G)? E T + E F T * T a F a F a
A Grammar for a Small Language programbeginstmt_listend stmt_liststmt stmt;stmt_list stmtvar=expression varABC expressionvar+var var-var var
Predictive Parsing • How many characters of look-ahead are needed to predict the next production to take? • Is this a finite number? • Is it 1?
Another Expression Grammar G’ = ({E,E’,T,T’,F},{a,+,*,(,)},P,E) where P = {ETE’, E’+TE’ | , TFT’, T’*FT’ | , Fa | (E) } Does a+a*a in L(G’)? E T E’ F T’ + T E’ a F T’ a * F T’ a
LL(1) Algorithm input stack Parse(a1 … an, X1 … Xm) { if (a1=$) & (X1=$) accept else if X1 is a terminal and (X1=a1) Parse(a2 … an, X2 … Xm) // match else if Table[X1,a1] = X1Y1 … Yk Parse(a1 … an, Y1 … YkX2 … Xm) / derive else fail } • Call initially with Parse(w$,S$), where w is the phrase to parse and S is the starting symbol of the grammar ai is a terminal Xj Yk are terminals or nonterminals
INPUT a + a * a $ a + a * a $ a + a * a $ a + a * a $ + a * a $ + a * a $ + a * a $ a * a $ a * a $ a * a $ * a $ * a $ a $ a $ $ $ $ STACK E $ T E’ $ F T’ E’ $ a T’ E’ $ T’ E’ $ E’ $ + T E’ $ T E’ $ F T’ E’ $ a T’ E’ $ T’ E’ $ * F T’ E’ $ F T’ E’ $ a T’ E’ $ T’ E’ $ E’ $ $ Parser Operation on a+a*a # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 OPERATION derive derive derive match derive derive match derive derive match derive match derive match derive derive accept Sentential Form E $ T E’ $ F T’ E’ $ a T’ E’ $ a T’ E’ $ a E’ $ a + T E’ $ a + T E’ $ a + F T’ E’ $ a + a T’ E’ $ a + a T’ E’ $ a + a * F T’ E’ $ a + a * F T’ E’ $ a + a * a T’ E’ $ a + a * a T’ E’ $ a + a * a E’ $ a + a * a $
Note how the leftmost derivation of a+a*a is done Sentential Form E $ T E’ $ F T’ E’ $ a T’ E’ $ a T’ E’ $ a E’ $ a + T E’ $ a + T E’ $ a + F T’ E’ $ a + a T’ E’ $ a + a T’ E’ $ a + a * F T’ E’ $ a + a * F T’ E’ $ a + a * a T’ E’ $ a + a * a T’ E’ $ a + a * a E’ $ a + a * a $ E T E’ F T’ + T E’ a F T’ a * F T’ a
What’s the Table Lookup • Note that the predictive nature of the parser guarantees the uniqueness of the entry for Table[A,b] (or no entry at all) • When attempting to derive nonterminal A, the look-ahead b must give the correct rule to apply • This b can be • the initial character of the derivation of A, i.e., A *b, • or, it can be the initial character of the derivation of what follows A! (A *)
First Sets • first() is the set of one-character prefixes of strings of terminals that can be derived from • If the empty string can be derived from , then it will also be in the set • if * aw then a first() • if * then first()
First Sets (II) • first() ={} • first(a) ={a} • first(A) = first(1) … first(n) if A 1 P, …, A n P • first(X) = first(X)first() where X is either terminal or nonterminal
Bounded Concatenation • In computing first(X), our interest is to obtain one-character prefixes (or ) • Consider the operation at the char level • = , where is either or a terminal • a = a • Generalize it to work on sets • AB = {vw | vA, wB}, where A & B are sets
Follow Sets • Follow(A) is the set of prefixes of strings of terminals that can follow any derivation of A in G • $ follow(S) • if(BA)P, then • first()follow(B) follow(A) • The definition of follow usually results in recursive set definitions. In order to solve them, you need to do several iterations on the equations • never appears in any follow set • Note: I had promised a closed definition of follow, but it will be unnecessarily complex. JCG.
How to Fill In the Table (Predict) • For each production (A)P let X=first()follow(A) then for all xX B Table[A,x] • After processing all productions, each cell of the table must have, at most, one production • if not, your grammar is not LL(1) (nice try!)
Yet Another Expression Grammar (it’s in the book!) G = ({E,T,F},{a,+,*,(,)},P,E) where P = { EE+T, ET, TT*F, TF, F(E), Fa} Does a+a*a in L(G)? E E + T * T T F a F F a a
LR(1) Parsing Table Sn:shift to staten Rn:reduce according to productionn
LR(1) Algorithm stack input Parse(S0X1S1X2S2 … XrSr … XmSm,a1 … an) { if Action[Sm,a1] == Shift S Parse(S0X1S1X2S2 … XmSma1S,a2 … an) else if Action[Sm,a1] == Reduce AXr+1…Xm and GOTO[Sr,A] == S Parse(S0X1S1X2S2 … XrS,a1 … an) else if Action[Sm,a1] == Accept accept else if Action[Sm,a1] == Error error } • Call initially with Parse(S0,w$), where w is the phrase to parse and S0 is the initial state of the table ai is a terminal Xj Yk are terminals or nonterminals Si is a “state”
STACK 0 0 a 5 0 F 3 0 T 2 0 E 1 0 E 1 + 6 0 E 1 + 6 a 5 0 E 1 + 6 F 3 0 E 1 + 6 T 9 0 E 1 + 6 T 9 * 7 0 E 1 + 6 T 9 * 7 a 5 0 E 1 + 6 T 9 * 7 F 10 0 E 1 + 6 T 9 0 E 1 Parser Operation on a+a*a # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 INPUT a + a * a $ + a * a $ + a * a $ + a * a $ + a * a $ a * a $ * a $ * a $ * a $ a $ $ $ $ $ OPERATION S 5 R 6, G[0,F] R 4, G[0,T] R 2, G[0,E] R 6 S 5 R 6, G[6,F] R 4, G[6,T] S 7 S 5 R 6, G[7,F] R 3, G[7,T] R 1, G[0,E] accept Sentential Form a + a * a $ a + a * a $ F + a * a $ T + a * a $ E + a * a $ E + a * a $ E + a * a $ E + F * a $ E + T * a $ E + T * a $ E + T * a $ E + T * F $ E + T $ E $
Note how the rightmost derivation of a+a*a is done Sentential Form E $ E + T $ E + T * F $ E + T * a $ E + T * a $ E + T * a $ E + F * a $ E + a * a $ E + a * a $ E + a * a $ T + a * a $ F + a * a $ a + a * a $ a + a * a $ E E + T * T T F a F F a a