1 / 95

Chapter 4 Syntactic Analysis

Chapter 4 Syntactic Analysis. Fall 2013. Syntactic Analysis. Sub-phases of Syntactic Analysis Grammars Revisited Parsing Abstract Syntax Trees Scanning Case Study: Syntactic Analysis in the Triangle Compiler. Structure of a Compiler. Lexical Analyzer. Source code. Symbol Table.

ohio
Télécharger la présentation

Chapter 4 Syntactic Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 4Syntactic Analysis Fall 2013

  2. Syntactic Analysis • Sub-phases of Syntactic Analysis • Grammars Revisited • Parsing • Abstract Syntax Trees • Scanning • Case Study: Syntactic Analysis in the Triangle Compiler

  3. Structure of a Compiler Lexical Analyzer Source code Symbol Table tokens Parser & Semantic Analyzer parse tree Intermediate Code Generation intermediate representation Optimization intermediate representation Assembly Code Generation Assembly code

  4. Syntactic Analysis • Main function • Parse source program to discover its phrase structure • Recursive-descent parsing • Constructing an AST • Scanning to group characters into tokens

  5. Sub-phases of Syntactic Analysis • Scanning (or lexical analysis) • Source program transformed to a stream of tokens • Identifiers • Literals • Operators • Keywords • Punctuation • Comments and blank spaces discarded • Parsing • To determine the source programs phrase structure • Source program is input as a stream of tokens (from the Scanner) • Treats each token as a terminal symbol • Representation of phrase structure • AST

  6. Lexical Analysis – A Simple Example • Scan the file character by character and group characters into words and punctuation (tokens), remove white space and comments • Tokens for this example: let var y : Integer in y := y + 1 let var y: Integer in !new year y := y+1 Note: !new year does not appear in list of tokens. Comments are removed along with white spaces.

  7. let var y: Integer in !new year y := y+1 Buffer (S = space) Creating Tokens – Mini-Triangle Example Input Converter character string . . . . S S S S l e t v a r y : I n t e g e r i n Scanner Ident. Ident. becomes Ident. op. Intlit. eot Ident. colon in let var := 1 : y Integer y y + in let var

  8. Tokens in Triangle // literals, identifiers, operators... INTLITERAL = 0, "<int>", CHARLITERAL = 1, "<char>", IDENTIFIER = 2, "<identifier>", OPERATOR = 3, "<operator>", // reserved words - must be in alphabetical order... ARRAY = 4, "array", BEGIN = 5, "begin", CONST = 6, "const", DO = 7, "do", ELSE = 8, "else", END = 9, "end", FUNC = 10, "func", IF = 11, "if", IN = 12, "in", LET = 13, "let", OF = 14, "of", PROC = 15, "proc", RECORD = 16, "record", THEN = 17, "then", TYPE = 18, "type", VAR = 19, "var", WHILE = 20, "while", // punctuation... DOT = 21, ".", COLON = 22, ":", SEMICOLON = 23, ";", COMMA = 24, ",", BECOMES = 25, "~", IS = 26, // brackets... LPAREN = 27, "(", RPAREN = 28, ")", LBRACKET = 29, [", RBRACKET = 30, "]", LCURLY = 31, "{", RCURLY = 32, "}", // special tokens... EOT = 33, "", ERROR = 34; "<error>"

  9. Grammars Revisited • Context free grammars • Generates a set of sentences • Each sentence is a string of terminal symbols • An unambiguous sentence has a unique phrase structure embodied in its syntax tree • Develop parsers from context-free grammars

  10. Regular Expressions • A regular expression (RE) is a convenient notation for expressing a set of stings of terminal symbols • Main features • ‘|’ separates alternatives • ‘*’ indicates that the previous item may be represented zero or more times • ‘(‘ and ‘)’ are grouping parentheses • e The empty string -- a special string of length 0

  11. Regular Expression Basics • Algebraic Properties • | is commutative and associative • r|s = s|r • r|(s|t) = (r|s)|t • Concatenation is associative • (rs)t = r(st) • Concatenation distributes over | • r(s|t) = rs|rt • (s|t)r = sr|tr • e is the identity for concatenation • e r = r • r e = r • * is idempotent • r** = r* • r* = (r| e)*

  12. Regular Expression Basics • Common Extensions • r+ one or more of expression r, same as rr* • rk k repetitions of r • r3 = rrr • ~r the characters not in the expression r • ~[\t\n] • r-z range of characters • [0-9a-z] • r? Zero or one copy of expression (used for fields of an expression that are optional)

  13. Regular Expression Example • Regular Expression for Representing Months • Examples of legal inputs • January represented as 1 or 01 • October represented as 10 • First Try: [0|1|e][0-9] 0, 1, or e followed by a number between 0 and 9 • Matches all legal inputs? Yes 1, 2, 3, …, 10, 11, 12, 01, 02, …, 09 • Matches any illegal inputs? Yes 0, 00, 18

  14. Regular Expression Example • Regular Expression for Representing Months • Examples of legal inputs • January represented as 1 or 01 • October represented as 10 • Second Try: [1-9]|(0[1-9])|(1[0-2]) • Any number between 1 and 9 or 0 followed by any number between 1 and 9 or 1 followed by any number between 0 and 2 • Matches all legal inputs? Yes 1, 2, 3, …, 10, 11, 12, 01, 02, …, 09 • Matches any illegal inputs? No

  15. Regular Expression Example • Regular Expression for Floating Point Numbers • Examples of legal inputs • 1.0, 0.2, 3.14159, -1.0, 2.7e8, 1.0E-6, -2.5e+5 • Assume that a 0 is required before numbers less than 1 and does not prevent extra leading zeros, so numbers such as 0011 or 0003.14159 are legal • Building the regular expression • Assume digit  0|1|2|3|4|5|6|7|8|9 • Handle simple decimals such as 1.0, 0.2, 3.14159 digit+.digit+ 1 or more digits followed by . followed by 1 or more decimals • Add an optional sign (only minus, no plus) (-| e)digit+.digit+ or -?digit+.digit+

  16. Regular Expression Example • Regular Expression for Floating Point Numbers (cont.) • Building the regular expression (cont.) • Format for the exponent (E|e)(+|-)?(digit+) • Adding it as an optional expression to the decimal part (-| e)digit+.digit+((E|e)(+|-)?(digit+))?

  17. Extended BNF • Extended BNF (EBNF) • Combination of BNF and RE • N::=X, where N is a nonterminal symbol and X is an extended RE, i.e., an RE constructed from both terminal and nonterminal symbols • EBNF • Right hand side may use |. *, (, ) • Right hand side may contain both terminal and nonterminal symbols

  18. Example EBNF Expression ::= primary-Expression (Operator primary-Expression)* primary-Expression ::= Identifier | ( Expression ) Identifier ::= a|b|c|d|e Operator ::= +|-|*|/ Generates e a + b a – b – c a + (b * c) a + (b + c) / d a – (b – (c – (d – e)))

  19. Grammar Transformations • Left Factorization XY | XZ is equivalent to X(Y | Z) single-Command ::= V-name := Expression | if Expression then single-Command | if Expression then single-Command else single-Command single-Command ::= V-name := Expression | if Expression then single-Command (e |else single-Command)

  20. Grammar Transformations • Elimination of left recursion N::= X | NY is equivalent to N::=X(Y)* Identifier ::= Letter | Identifier Letter | Identifier Digit Identifier ::= Letter | Identifier (Letter | Digit) Identifier ::= Letter(Letter | Digit)*

  21. Grammar Transformations • Substitution of nonterminal symbols Given N::=X, we can substitute each occurrence of N with X iff N::=X is nonrecursive and is the only production rule for N single-Command ::= for Control-Variable := Expression To-or-Downto Expression do single-Command | … Control-Variable ::= Identifier To-or-Downto ::= to | down single-Command ::= for Identifier := Expression (to|downto) Expression do single-Command | …

  22. Starter Sets • Starter set of an RE X • Starters[[X]] • Set of terminal symbols that can start a string generated by X • Examples • Starter[[his | her | its]] = {h, i} • Starter[[(re)* set]] = {r, s}

  23. Starter Sets • Precise and complete definition of starters: starters[[e]] = {} starters[[t]] = {t} where t is a terminal symbol starters[[X Y]] = starters[[X]]  starters[[Y]] if X generates e starters[[X Y]] = starters[[X]] if X does not generate e starters[[X | Y]] = starters[[X]]  starters[[Y]] starters[[X *]] = starters[[X]] • To generalize for a starter set of an extended RE add • starters[[N]] = starters[[X]] where N is a nonterminal symbol defined production rule N ::= X

  24. Example Starter Set Expression ::= primary-Expression (Operator primary-Expression)* primary-Expression ::= Identifier | ( Expression ) Identifier ::= a|b|c|d|e Operator ::= +|-|*|/ starters[[Expression]] = starters[[primary-Expression (Operator primary-Expression)*]] = starters[[primany-Expression]] = starters[[Identifier]]  starters[[ (Expressions ) ]] = starters[[a | b | c | d | e]]  { ( } = {a, b, c, d, e, (}

  25. Scanning (Lexical Analysis) • The purpose of scanning is to recognize tokens in the source program. Or, to group input characters (the source program text) into tokens. • Difference between parsing and scanning: • Parsing groups terminal symbols, which are tokens, into larger phrases such as expressions and commands and analyzes the tokens for correctness and structure • Scanning groups individual characters into tokens

  26. Structure of a Compiler Lexical Analyzer Source code Symbol Table tokens Parser & Semantic Analyzer parse tree Intermediate Code Generation intermediate representation Optimization intermediate representation Assembly Code Generation Assembly code

  27. let var y: Integer in !new year y := y+1 Buffer (S = space) Creating Tokens – Mini-Triangle Example Input Converter character string . . . . S S S S l e t v a r y : I n t e g e r i n Scanner Ident. Ident. becomes Ident. op. Intlit. eot Ident. colon in let var := 1 y : Integer y y + in let var

  28. What Does a Scanner Do? • Handle keywords (reserve words) • Recognizes identifiers and keywords • Match explicitly • Write regular expression for each keyword • Identifier is any alpha numeric string which is not a keyword • Match as an identifier, perform lookup • No special regular expressions for keywords • When an identifier is found, perform lookup into preloaded keyword table How does Triangle handle keywords? Discuss in terms of efficiency and ease to code.

  29. What Does a Scanner Do? • Remove white space • Tabs, spaces, new lines • Remove comments • Single line -- Ada comment • Multi-line, start and end delimiters { Pascal comment } /* c comment */ • Nested • Runaway comments • Nonterminated comments can’t be detected till end of file

  30. What Does a Scanner Do? • Perform look ahead • Multi-character tokens 1..10 vs. 1.10 &, && <, <= etc • Challenging input languages • FORTRAN • Keywords not reserved • Blanks are not a delimiter • Example (comma vs. decimal) DO10I=1,5 start of a do loop (equivalent to a C for loop) DO10I=1.5 an assignment statement, assignment to variable DO10I

  31. What Does a Scanner Do? • Challenging input languages (cont.) • PL/I, keywords not reserved IF THEN THEN THEN = ELSE; ELSE ELSE = THEN;

  32. What Does a Scanner Do? • Error Handling • Error token passed to parser which reports the error • Recovery • Delete characters from current token which have been read so far, restart scanning at next unread character • Delete the first character of the current lexeme and resume scanning from next character. • Examples of lexical errors: • 3.25e bad format for a constant • Var#1 illegal character • Some errors that are not lexical errors • Mistyped keywords • Begim • Mismatched parenthesis • Undeclared variables

  33. Scanner Implementation • Issues • Simpler design – parser doesn’t have to worry about white space, etc. • Improve compiler efficiency – allows the construction of a specialized and potentially more efficient processor • Compiler portability is enhanced – input alphabet peculiarities and other device-specific anomalies can be restricted to the scanner

  34. Scanner Implementation • What are the keywords in Triangle? • How are keywords and identifiers implemented in Triangles? • Is look ahead implemented in Triangle? • If so, how?

  35. Structure of a Compiler Lexical Analyzer Source code Symbol Table tokens Semantic Analyzer Parser parse tree Intermediate Code Generation intermediate representation Optimization intermediate representation Assembly Code Generation Assembly code

  36. Parsing • Given an unambiguous, context free grammar, parsing is • Recognition of an input string, i.e., deciding whether or not the input string is a sentence of the grammar • Parsing of an input string, i.e., recognition of the input string plus determination of its phrase structure. The phrase structure can be represented by a syntax tree, or otherwise. Unambiguous is necessary so that every sentence of the grammar will form exactly one syntax tree.

  37. Parsing • The syntax of programming language constructs are described by context-free grammars. • Advantages of unambiguous, context-free grammars • A precise, yet easy-to understand, syntactic specification of the programming language • For certain classes of grammars we can automatically construct an efficient parser that determines if a source program is syntactically well formed. • Imparts a structure to a programming language that is useful for the translation of source programs into correct object code and for the detection of errors. • Easier to add new constructs to the language if the implementation is based on a grammatical description of the language

  38. parser sequence of tokens syntax tree Parsing • Check the syntax (structure) of a program and create a tree representation of the program • Programming languages have non-regular constructs • Nesting • Recursion • Context-free grammars are used to express the syntax for programming languages

  39. Context-Free Grammars • Comprised of • A set of tokens or terminal symbols • A set of non-terminal symbols • A set of rules or productions which express the legal relationships between symbols • A start or goal symbol • Example: • expr  expr – digit • expr  expr + digit • expr  digit • digit 0|1|2|…|9 • Tokens: -,+,0,1,2,…,9 • Non-terminals: expr, digit • Start symbol: expr

  40. Context-Free Grammars expr • expr  expr – digit • expr  expr + digit • expr  digit • digit 0|1|2|…|9 expr - digit expr digit + 2 Example input: 3 + 8 - 2 digit 8 3

  41. Checking for Correct Syntax • Given a grammar for a language and a program, how do you know if the syntax of the program is legal? • A legal program can be derived from the start symbol of the grammar Grammar must be unambiguous and context-free

  42. expr  expr – digit • expr  expr + digit • expr  digit • digit 0|1|2|…|9 Example input: 3 + 8 - 2 Deriving a String • The derivation begins with the start symbol • At each step of a derivation the right hand side of a grammar rule is used to replace a non-terminal symbol • Continue replacing non-terminals until only terminal symbols remain Rule 2 Rule 1 Rule 4 expr  expr – digit  expr – 2 expr + digit - 2 Rule 3 Rule 4 Rule 4  expr + 8-2 digit + 8-23+8 -2

  43. Rule 1 expr  expr – digit • expr  expr – digit • expr  expr + digit • expr  digit • digit 0|1|2|…|9 Example input: 3 + 8 - 2 Rightmost Derivation • The rightmost non-terminal is replaced in each step Rule 4 expr – digit  expr – 2 Rule 2 expr – 2 expr + digit - 2 Rule 4 expr + digit - 2  expr + 8-2 Rule 3 expr + 8-2 digit + 8-2 Rule 4 digit + 8-23+8 -2

  44. Rule 1 expr  expr – digit • expr  expr – digit • expr  expr + digit • expr  digit • digit 0|1|2|…|9 Example input: 3 + 8 - 2 Leftmost Derivation • The leftmost non-terminal is replaced in each step Rule 2 expr – digit  expr + digit – digit Rule 3 expr + digit – digit  digit + digit – digit Rule 4 digit + digit – digit3 + digit – digit Rule 4 3 + digit – digit 3 + 8 – digit Rule 4 3 + 8 – digit 3 + 8 – 2

  45. Rule 1 expr  expr – digit Leftmost Derivation • The leftmost non-terminal is replaced in each step expr 1 1 Rule 2 expr – digit  expr + digit – digit 6 2 2 expr - digit Rule 3 expr + digit – digit  digit + digit – digit 3 3 5 expr digit + Rule 4 4 digit + digit – digit3 + digit – digit 2 Rule 4 3 + digit – digit 3 + 8 – digit 5 4 digit 8 Rule 4 3 + 8 – digit 3 + 8 – 2 6 3

  46. Bottom-Up Parsing • Parser examines terminal symbols of the input string, in order from left to right • Reconstructs the syntax tree from the bottom (terminal nodes) up (toward the root node) • Bottom-up parsing reduces a string w to the start symbol of the grammar. • At each reduction step a particular sub-string matching the right side of a production is replaced by the symbol on the left of that production, and if the sub-string is chosen correctly at each step, a rightmost derivation is traced out in reverse.

  47. Bottom-Up Parsing • Types of bottom-up parsing algorithms • Shift-reduce parsing • At each reduction step a particular sub-string matching the right side of a production is replaced by the symbol on the left of that production, and if the sub-string is chosen correctly at each step, a rightmost derivation is traced out in reverse. • LR(k) parsing • L is for left-to-right scanning of the input, the R is for constructing a right-most derivation in reverse, and the k is for the number of input symbols of look-ahead that are used in making parsing decisions.

  48. expr  expr – digit • expr  expr + digit • expr  digit • digit 0|1|2|…|9 - 3 8 2 + digit Example input: 3 + 8 - 2 - 3 8 2 + digit digit digit digit - 3 8 2 + expr - 3 8 2 + Bottom-Up Parsing Example3+8-2

  49. expr - 3 8 2 + expr digit digit digit digit digit digit digit digit - 3 8 2 + expr expr - 3 8 2 + Bottom-Up Parsing Example3+8-2

  50. S  aABe • A  Abc | b • B  d a b b c d e Example input: abbcde A a b b c d e Abbcde  aAbcde A a b b c d e aAbcde Bottom-Up Parsing Exampleabbcde

More Related