UNIT-III

UNIT-III Compiler: phases of compiler, difference between phase and a pass, lexical analyzer. Top down parsing: ambiguity, LL(1) grammars , LL(1) parsing

Logical phases of compiler The compiler implementation process is divided into two parts: Analysis of a source program Synthesis of a target program Analysis involves analyzing the different constructs of the program Analysis consists of three phases: Lexical analysis (Linear or scanning) Syntax Analysis (hierarchical) Semantic analysis Synthesis of a target program includes three phases: Intermediate code generator code optimizer code generator

Terms used in lexical analysis Lexeme: Lexemes are the smallest logical units of a program. It is sequence of characters in the source program for which a token is produced. Tokens: Class of similar lexemes are identified by the same token. Pattern : Pattern is a rule which describes a token Ex: Pattern of an identifier is that it should consists of letters and digits but the first character should be a letter. int x = 10; int is a lexeme for the token keyword x is a lexeme for the token identifier int x = Lexemes 10

The lexical analyzer must also pass additional information along with token, when a pattern matches more than one lexeme. These items information are called attributes for tokens. Generally a token of type identifier has single attribute, i.e. pointer to the symbol table entry. Ex: x = y + 2 Lexeme < token , Token attribute > x <identifier , pointer to the symbol table entry> = <assign-op , > y <identifier , pointer to the symbol table> + <Add-op , > 2 <const , integer value 2>

Lexical Analyzer The lexical analyzer is the first phase of compiler • Basic functions of lexical analyzer: 1. It read the characters and produces as output a sequence of tokens 2. It removes comments, white spaces (blank ,tab ,new line character) from the source program. 3. If lexical analyzer identifies any token of type identifier ,they are placed in symbol table 4. Lexical analyzer may keep track of the no. of new line characters seen, so that a line number can be associated with an error message.

Phases of compiler Source program Lexical errors Lexical analyzer tokens Syntax analyzer Error Handler Symbol Table syntax Parse tree semantic Semantic analyzer Syntax tree Intermediate code generator Intermediate code Code Optimizer Optimized code Code generator Target program

temp1: = inttoreal (10) temp2: = id3 * temp1 temp3: = id2 + temp2 id1: = temp3 temp1:=id3*10.0 id1: = id2 + temp1 MOVF id3, R2 MULF #10.0, R2 MOVF id2 , R1 ADDF R2, R1 MOVF R1, id1 • Position : = initial + rate * 10 id1: = id2 + id3 * 10 : = id1 + id2 * id3 10 : = id1 + id2 * id3 inttoreal 10 Intermediate code generator Lexical anayzer Syntax analyzer Code Optimizer Semantic analyzer Code generator

Phases and Passes Pass Phase 1) A phase is a logically cohesive operation that takes i/p in one form and produces o/p in another form. 2) No need of any intermediate files in between phases. 3) Splitting into more no. of phases reduces the complexity of the program. 4) Reduction in no. of phases, increases the execution speed 1) Pass is a physical scan over a source program. The portions of one or more phases are combined into a module called pass. 2) Requires an intermediate file between two passes 3) Splitting into more no. of passes reduces memory. 4) Single pass compiler is faster than two pass .

Two-pass assembly • In the first pass, all the identifiers that denote storage locations are found and stored in a symbol table. Identifiers are assigned storage locations as they are encountered for the first time. • In the second pass, the assembler scans the i/p again. This time, it translates each operation code into the sequence of bits representing that operation in m/c language. ( It translates each identifier representing a location into the address given for that identifier in that symbol table. The o/p of second pass is relocatable m/c code, means that it can be loaded starting at any location in memory.)

token source program get next token lexical analyzer symbol table parser Lexical Analyzer in Perspective

E E + E id * E E id id E * E E E + E id id id Ambiguity • A grammar produces more than one left most parse tree or more than one right most parse tree for a sentence is called as an ambiguous grammar. E  E+E  id+E  id+E*E  id+id*E  id+id*id E  E*E  E+E*E  id+E*E  id+id*E  id+id*id

Ambiguity (cont.) For the most parsers, the grammar must be unambiguous. unambiguous grammar  unique selection of the parse tree for a sentence We should eliminate the ambiguity in the grammar during the design phase of the compiler. An unambiguous grammar should be written to eliminate the ambiguity. We have to prefer one of the parse trees of a sentence (generated by an ambiguous grammar) to disambiguate that grammar to restrict to this choice.

Elimination of ambiguity in grammars If the grammar is in the form of S→SS | 1|2|…|n Is replaced by S→SS1 |S1 S1→1|2|…|n Ex: E  E+E | E*E | id | (E) is replaced by E  E+T / T TT*F/ F F  id | (E)

Left Recursion • A grammar is left recursive if it has a non-terminal A such that there is a derivation. A  A for some string  • Top-down parsing techniques cannot handle left-recursive grammars. • So, we have to convert our left-recursive grammar into an equivalent grammar which is not left-recursive. • The left-recursion may appear in a single step of the derivation (immediate left-recursion), or may appear in more than one step of the derivation. +

Immediate Left-Recursion A  A  |  where  does not start with A  eliminate immediate left recursion A  A’ A’ A’ |  an equivalent grammar A  A 1 | ... | A m | 1 | ... | n where 1 ... n do not start with A  eliminate immediate left recursion A 1 A’ | ... | n A’ A’1 A’ | ... | m A’ |  an equivalent grammar

Left-Recursion -- Problem • A grammar cannot be immediately left-recursive, but it still can be left-recursive. • By just eliminating the immediate left-recursion, we may not get a grammar which is not left-recursive. S  Aa | b A  Sc | d This grammar is not immediately left-recursive, but it is still left-recursive. S Aa Sca or A Sc Aac causes to a left-recursion • So, we have to eliminate all left-recursions from our grammar

A  A 1 | ... | A m | 1 | ... | n where 1 ... n do not start with A  eliminate immediate left recursion A 1 A’ | ... | n A’ A’1 A’ | ... | m A’ |  an equivalent grammar S  Aa | b A A c |S d |ε A A c | A a d | b d |ε S A a | b A b d A’ | A’ A’ cA’ | a d A’ | ε

Left-Factoring (Eliminating common sub expressions from the productions) • In general, A 1 | 2 where  is non-empty and the first symbols of 1 and 2 are different. • It is not clear which of two alternative productions to use to expand a nonterminal A i.e. A to 1 or A to 2 • But, if we re-write the grammar as follows A A’ A’ 1 | 2 so, we can immediately expand A to A’

Left-Factoring – Example A  ad | a | ab | abc | b A A’ A’ 1 | 2  A  aA’ | b A’  d |  | b | bc  A  aA’ | b A’  d |  | bA’’ A’’  | c

Top-Down Parsing • Paring is a process of determining if a string of tokens can be generated by a grammar. • The parse tree is created top to bottom. • Top-down parser • Recursive-Descent Parsing • Backtracking is needed (If a choice of a production rule does not work, we backtrack to try other alternatives.) • It is a general parsing technique, but not widely used. • Not efficient • Predictive Parsing • no backtracking • efficient • needs a special form of grammars (LL(1) grammars). • Recursive Predictive Parsing is a special form of Recursive Descent parsing without backtracking. • Non-Recursive (Table Driven) Predictive Parser is also known as LL(1) parser.

Recursive-Descent Parsing (uses Backtracking) • Backtracking is needed. • It tries to find the left-most derivation. Example S  aBc B  bc | b S S input: abc a B c a B c b c b fails, backtrack

Recursive-Descent Parsing-Example S  c A d A  a b | a Input: w = cad S S c A d c A d a b a fails, backtrack

Non-Recursive Predictive Parsing -- LL(1) Parser • Non-Recursive predictive parsing is a table-driven parser. • It is a top-down parser. • It is also known as LL(1) Parser. input buffer stack output X Y Z $ Parsing table $ + a b Predictive parsing program

LL(1) Grammars A grammar whose parsing table has no multiple-defined entries is said to be LL(1) grammar. one input symbol used as a look-head symbol do determine parser action LL(1) left most derivation input scanned from left to right The parsing table of a grammar may contain more than one production rule. In this case, we say that it is not a LL(1) grammar.

LL(1) Parser Input buffer • our string to be parsed. We will assume that its end is marked with a special symbol $. Output • a production rule representing a step of the derivation sequence (left-most derivation) of the string in the input buffer. Stack • contains the grammar symbols • at the bottom of the stack, there is a special end marker symbol $. • initially the stack contains only the symbol $ and the starting symbol S. $S initial stack • when the stack is emptied (ie. only $ left in the stack), the parsing is completed. Parsing table • a two-dimensional array M[A,a] • each row is a non-terminal symbol • each column is a terminal symbol or the special symbol $ • each entry holds a production rule.

LL(1) Parser – Parser Actions • The symbol at the top of the stack (say X) and the current symbol in the input string (say a) determine the parser action. • There are four possible parser actions. • If X and a are $  parser halts (successful completion) • If X and a are the same terminal symbol (different from $)  parser pops X from the stack, and moves the next symbol in the input buffer. • If X is a non-terminal  parser looks at the parsing table entry M[X,a]. If M[X,a] holds a production rule XY1Y2...Yk, it pops X from the stack and pushes Yk,Yk-1,...,Y1into the stack. The parser also outputs the production rule XY1Y2...Yk to represent a step of the derivation. • none of the above  error • all empty entries in the parsing table are errors. • If X is a terminal symbol different from a, this is also an error case.

LL(1) Parser – Example S  aBa B  bB B  bB B   Derivation(left-most): SaBaabBaabbBaabba S parse tree a B a b B b B 

Constructing LL(1) Parsing Tables • Two functions are used in the construction of LL(1) parsing tables: • FIRST FOLLOW • FIRST() is a set of the terminal symbols which occur as first symbols in strings derived from  where  is any string of grammar symbols. if  derives to , then  is also in FIRST() . • FOLLOW(A) is the set of the terminals which occur immediately after (follow) the non-terminal A in the strings derived from the starting symbol. • a terminal a is in FOLLOW(A) if S  Aa • $ is in FOLLOW(A) if S  A * *

FIRST(X) is computed as follows: • if X is a terminal, then FIRST(X)={X} • if X is a production, then add  to FIRST(X) • if X is a non-terminal and XY1Y2...Yn is a production, add FIRST(Yi) to FIRST(X) if the preceeding Yjs contain  in their FIRSTs

FOLLOW is computed as follows: • Place $ in FOLLOW(S), s is start symbol and $ is the input end marker. • For productions AB, everything in FIRST() except  goes into FOLLOW(B) • For productions AB or AB where FIRST() contains , FOLLOW(B) contains everything that is in FOLLOW(A)

Example for computing FIRST function 1)E  TE’ 2)E’  +TE’ |  3)T  FT’ 4)T’  *FT’ |  5)F  (E) | id 1)FIRST(E)  FIRST(TE’) FIRST(T)  FIRST(FT’)  FIRST(E)= {(,id} 2)FIRST(E’) = FIRST(+TE’)={+} FIRST() = {} FIRST(E’)= {+, } 3)FIRST(T)  FIRST(FT’)  FIRST(F) = {(,id} FIRST(T) = {(, id} 4)FIRST(T') = FIRST(*FT’ ) ={*} FIRST() ={} FIRST(T') ={*, } 5)FIRST(F)  FIRST((E)) = {(} ,FIRST(id) = {id} FIRST(F) = {(, id}

Example for computing FOLLOW function There are two ways to compute FOLLOW based on production 1) AB i) FOLLOW(B)=FIRST() (production does not contain ) ii) FOLLOW(B)= (FIRST() - ) U (FOLLOW(A)) (production contain ) 2)AB i) FOLLOW(B)= FOLLOW(A)

1)FOLLOW(E) E  TE’ (finding the occurrence of non-terminal on right side productions) i) E  TE’ T  FT’ F  (E) | id ii) F  (E) is in the form of A  B iii) FOLLOW(B)=FIRST() iv) FOLLOW(E)= FIRST( ) ) v) FOLLOW(E)={$, )}

2)FOLLOW(E’) E’  +TE’ |  i) contain  hence FOLLOW(B)=(FIRST() - ) U (FOLLOW(A)) E’  +TE’ is in the form of AB ii) FOLLOW(T)={+} U (FOLLOW(E’)) iii) FOLLOW(T)={+}U ? iv) FOLLOW(E’) (occurrence of E’ on right side productions) v) E  TE’ is in the form of vi) AB vii) FOLLOW(E’)=FOLLOW(E)= {$, )} viii) FOLLOW(T)={+}U {$, )} ={+,$, )}

3)FOLLOW(T’) T  FT’ is in the form of a) AB hence FOLLOW(B)= FOLLOW(A) b) FOLLOW(T’)=FOLLOW(T)= {+,$, )}

4) FOLLOW(F) T’  *FT’ |  a) contain  hence FOLLOW(B)=(FIRST() - ) U (FOLLOW(A)) T’  *FT’is in the form of AB b) FOLLOW(F)={*}U(FOLLOW(T’)) c) FOLLOW(F)={*}U ? d) FOLLOW(T’) (occurrence of T’ on right side productions) e) T  FT’ is in the form of f) AB g) FOLLOW(T’)=FOLLOW(T)= {+,$, )} h) FOLLOW(F)={*}U {+,$, )} ={*,+,$, )}

FIRST(E) = {(, id} • FIRST(E') = {+, } • FIRST(T') = {*, } • FOLLOW(E) = FOLLOW(E') = {$, )} • FOLLOW(T) = FOLLOW(T') = {+, $, )} • FOLLOW(F) = {*, +, $, )}

Constructing LL(1) Parsing Table -- Algorithm • for each production rule A   of a grammar G • for each terminal a in FIRST()  add A   to M[A,a] • If  in FIRST()  for each terminal b in FOLLOW(A) add A   to M[A,b] • If  in FIRST() and $ in FOLLOW(A)  add A   to M[A,$] • All other undefined entries of the parsing table are error entries.

Constructing LL(1) Parsing Table -- Example E  TE’ FIRST(TE’)={(,id}  E  TE’ into M[E,(] and M[E,id] E’  +TE’ FIRST(+TE’ )={+}  E’  +TE’ into M[E’,+] E’   FIRST()={}  none but since  in FIRST() and FOLLOW(E’)={$,)}  E’   into M[E’,$] and M[E’,)] T  FT’ FIRST(FT’)={(,id}  T  FT’ into M[T,(] and M[T,id] T’  *FT’ FIRST(*FT’ )={*}  T’  *FT’ into M[T’,*] T’   FIRST()={}  none but since  in FIRST() and FOLLOW(T’)={$,),+}  T’   into M[T’,$], M[T’,)] and M[T’,+] F  (E) FIRST((E) )={(}  F  (E) into M[F,(] F  id FIRST(id)={id}  F  id into M[F,id]

LL(1) Grammars • A grammar whose parsing table has no multiple-defined entries is said to be LL(1) grammar. one input symbol used as a look-head symbol do determine parser action LL(1) left most derivation input scanned from left to right • The parsing table of a grammar may contain more than one production rule. In this case, we say that it is not a LL(1) grammar.

A Grammar which is not LL(1) S  i C t S E | a FOLLOW(S) = { $,e } E  e S |  FOLLOW(E) = { $,e } C  b FOLLOW(C) = { t } FIRST(iCtSE) = {i} FIRST(a) = {a} FIRST(eS) = {e} FIRST() = {} FIRST(b) = {b} two productionrules for M[E,e] Problem  ambiguity

A Grammar which is not LL(1) (cont.) • What do we have to do it if the resulting parsing table contains multiply defined entries? • If we didn’t eliminate left recursion, eliminate the left recursion in the grammar. • If the grammar is not left factored, we have to left factor the grammar. • If its (new grammar’s) parsing table still contains multiply defined entries, that grammar is ambiguous or it is inherently not a LL(1) grammar. • A left recursive grammar cannot be a LL(1) grammar. • A  A |  • any terminal that appears in FIRST() also appears FIRST(A) because A  . • If  is , any terminal that appears in FIRST() also appears in FIRST(A) and FOLLOW(A). • A grammar is not left factored, it cannot be a LL(1) grammar • A  1 | 2 • any terminal that appears in FIRST(1) also appears in FIRST(2). • An ambiguous grammar cannot be a LL(1) grammar.

Properties of LL(1) Grammars • A grammar G is LL(1) if and only if the following conditions hold for two distinctive production rules A   and A   • Both  and  cannot derive strings starting with same terminals. • At most one of  and  can derive to . • If  can derive to , then  cannot derive to any string starting with a terminal in FOLLOW(A).

Error Recovery in Predictive Parsing • An error may occur in the predictive parsing (LL(1) parsing) • if the terminal symbol on the top of stack does not match with the current input symbol. • if the top of stack is a non-terminal A, the current input symbol is a, and the parsing table entry M[A,a] is empty. • What should the parser do in an error case? • The parser should be able to give an error message (as much as possible meaningful error message). • It should be recover from that error case, and it should be able to continue the parsing with the rest of the input.

Error Recovery Techniques • Panic-Mode Error Recovery • Skipping the input symbols until a synchronizing token is found. • Phrase-Level Error Recovery • Each empty entry in the parsing table is filled with a pointer to a specific error routine to take care that error case. • Error-Productions • If we have a good idea of the common errors that might be encountered, we can augment the grammar with productions that generate erroneous constructs. • When an error production is used by the parser, we can generate appropriate error diagnostics. • Since it is almost impossible to know all the errors that can be made by the programmers, this method is not practical. • Global-Correction • Ideally, we would like a compiler to make as few change as possible in processing incorrect inputs. • We have to globally analyze the input to find the error. • This is an expensive method, and it is not in practice.

Panic-Mode Error Recovery in LL(1) Parsing • In panic-mode error recovery, we skip all the input symbols until a synchronizing token is found. • What is the synchronizing token? • All the terminal-symbols in the follow set of a non-terminal can be used as a synchronizing token set for that non-terminal. • So, a simple panic-mode error recovery for the LL(1) parsing: • All the empty entries are marked as synch to indicate that the parser will skip all the input symbols until a symbol in the follow set of the non-terminal A which on the top of the stack. Then the parser will pop that non-terminal A from the stack. The parsing continues from that state. • To handle unmatched terminal symbols, the parser pops that unmatched terminal symbol from the stack and it issues an error message saying that that unmatched terminal is inserted.

Panic mode recovery • Possible synchronizing tokens for a nonterminal A • the tokens in FOLLOW(A) • When one is found, pop A of the stack and try to continue • the tokens in FIRST(A) • When one is found, match it and try to continue • tokens such as semicolons that terminate statements

Panic-Mode Error Recovery - Example S  AbS | e |  A  a | cAd FOLLOW(S)={$} FOLLOW(A)={b,d} stackinputoutput $S ceadb$ S  AbS $SbA ceadb$ A  cAd $SbdAc ceadb$ $SbdA eadb$ Error: unexpected e (illegal A) (Remove all input tokens until b or d i.e. FOLLOW(A), pop A) $Sbd db$ $Sb b$ $S $ S   $ $ accept

UNIT-III

UNIT-III

Presentation Transcript

UNIT III

Unit -III

Unit III

Unit III

UNIT-III

III. UNIT III

III. UNIT III

UNIT - III

III. UNIT III

UNIT-III

III. UNIT III

UNIT III

Unit III

UNIT III

UNIT-III

Unit III

UNIT III