Chapter 3

Chapter 3 Chang Chi-Chung 2007.4.12

The Role of the Lexical Analyzer Token LexicalAnalyzer Parser SourceProgram getNextToken error error Symbol Table

The Reason for Using the Lexical Analyzer • Simplifies the design of the compiler • LL(1) or LR(1) parsing with 1 token lookahead would not be possible (multiple characters/tokens to match) • Compiler efficiency is improved • Systematic techniques to implement lexical analyzers by hand or automatically from specifications • Stream buffering methods to scan input • Compiler portability is enhanced • Input-device-specific peculiarities can be restricted to the lexical analyzer.

Tokens, Patterns, and Lexemes • Token (符號單元) • A pair consisting of a token name and optional arrtibute value. • Example: num, id • Pattern (樣本) • A description of the form for the lexemes of a token. • Example: “non-empty sequence of digits”, “letter followed by letters and digits” • Lexeme (詞) • A sequence of characters that matches the pattern for a token. • Example: 123, abc

Example: Tokens, Patterns, and Lexemes

Input Buffering lexemeBegin forward Sentinels

Strings and Languages • Alphabet • An alphabet is a finite set of symbols (characters) • String • A stringis a finite sequence of symbols from  • s denotes the length of string s •  denotes the empty string, thus  = 0 • Language • A language is a countable set of strings over some fixed alphabet  • Abstract Language Φ • {ε}

String Operations • Concatenation (連接) • The concatenation of two strings x and y is denoted by xy • Identity (單位元素) • The empty string is the identity under concatenation. • s = s = s • Exponentiation • Define s0 = si = si-1s for i > 0 • By Define s1 = s s2 = ss

Language Operations • UnionL M = { ssL or sM } • ConcatenationL M = { xyx L and yM} • ExponentiationL0 = {  } Li = Li-1L • Kleene closure(封閉包)L* = ∪i=0,…,Li • Positive closureL+ = ∪i=1,…,Li

Regular Expressions • Regular Expressions • A convenient means of specifying certain simple sets of strings. • We use regular expressions to define structures of tokens. • Tokens are built from symbols of a finite vocabulary. • Regular Sets • The sets of strings defined by regular expressions.

Regular Expressions • Basis symbols: •  is a regular expression denoting language L() = {} • a   is a regular expression denoting L(a) = {a} • If r and s are regular expressions denoting languages L(r) and M(s) respectively, then • rs is a regular expression denoting L(r)  M(s) • rs is a regular expression denoting L(r)M(s) • r* is a regular expression denoting L(r)* • (r) is a regular expression denoting L(r) • A language defined by a regular expression is called a regular set.

Operator Precedence

Algebraic Laws for Regular Expressions

Regular Definitions • If Σ is an alphabet of basic symbols, then a regular definitions is a sequence of definitions of the form: d1 r1 d2 r2 …dn rn • Each di is a new symbol, not in Σ and not the same as any other of d’s. • Each ri is a regular expression over the alphabet  {d1, d2, …, di-1 } • Any dj in rican be textually substituted inrito obtain an equivalent set of definitions

Example: Regular Definitions Regular Definitions letter_ A | B | … | Z | a | b | … | z | _ digit 0 | 1 | … | 9 id letter_ ( letter_ | digit )* Regular definitions are not recursivedigits  digit digits digit wrong

Extensions of Regular Definitions • One or more instance • r+ = rr* = r*r • r* = r+ | ε • Zero or one instance • r?=r |ε • Character classes • [a-z] = abc…z • [A-Za-z] = A|B|…|Z|a|…|z • Example • digit  [0-9] • num  digit+ (. digit+)? ( E (+-)? digit+ )?

Regular Definitions and Grammars Context-FreeGrammars stmt ifexprthenstmtif exprthenstmtelsestmtexpr  term reloptermtermterm  idnum ws ( blank | tab | newline )+ Regular Definitions digit [0-9] letter [A-Za-z] if if then then else elserelop <<=<>>>== id letter ( letter | digit )* num digit+ (. digit+)? ( E (+ | -)? digit+ )?

Transition Diagrams relop <<=<>>>== start < = 0 1 2 return(relop, LE) > 3 return(relop, NE) other * 4 return(relop, LT) = return(relop, EQ) 5 > = 6 7 return(relop, GE) other * 8 return(relop, GT)

Transition Diagrams id letter ( letter | digit )* letter or digit * other start letter 9 10 11 return (getToken(), installID() )

Finite Automata • Finite Automata are recognizers. • FA simply say “Yes” or “No” about each possible input string. • A FA can be used to recognize the tokens specified by a regular expression • Use FA to design of a Lexical Analyzer Generator • Two kind of the Finite Automata • Nondeterministic finite automata (NFA) • Deterministic finite automata (DFA) • Both DFA and NFA are capable of recognizing the same languages.

NFA Definitions • NFA = { S, , , s0, F } • A finite set of statesS • A set of input symbols Σ • input alphabet, ε is not in Σ • A transition function  •  : S  S • A special start state s0 • A set of final states F, F  S (accepting states)

Transition Graph for FA is a state is a transition is a the start state is a final state

3 Example a 0 1 2 a b c c • This machine accepts abccabc, but it rejects abcab. • This machine accepts (abc+)+.

a start b b a 0 1 2 3 b Transition Table • The mapping  of an NFA can be represented in a transition table (0, a) = {0,1}(0, b) = {0}(1, b) = {2}(2, b) = {3}

DFA • DFA is a special case of an NFA • There are no moves on input ε • For each state s and input symbol a, there is exactly one edge out of s labeled a. • Both DFA and NFA are capable of recognizing the same languages.

Input An input string x terminated by an end-of-file character eof. A DFA D with start state s0, accepting states F, and transition function move. Output Answer “yes” if D accepts x; “no” otherwise. s = s0 c = nextChar(); while ( c != eof ) { s = move(s, c); c = nextChar(); } if (s is in F ) return “yes”; else return “no”; Simulating a DFA

a start b b a 0 1 2 3 b b a b b 0 1 2 3 a a a S = {0,1,2,3} = {a, b}s0 = 0F = {3} NFA vs DFA (a | b)*abb

The Regular Language • The regularlanguagedefinedby an NFA is the set of input strings it accepts. • Example: (ab)*abb for the example NFA • An NFA accepts an input string x if and only if • there is some path with edges labeled with symbols from x in sequence from the start state to some accepting state in the transition graph • A state transition from one state to another on the path is called a move.

Theorem • The followings are equivalent • Regular Expression • NFA • DFA • Regular Language • Regular Grammar

Convert Concept Regular Expression Minimization Deterministic Finite Automata Nondeterministic Finite Automata Deterministic Finite Automata

 N(s)  s | t   N(t) a  s t N(s) N(t)   N(s) s*  Construction of an NFA from a Regular Expression Use Thompson’s Construction ε a 

r11 Example r9 r10 • ( a | b )* a b b r7 r8 b r5 r6 b r4 * a r3 ( ) r3 = r4 r1 | r2 a b

 a 2 3     start a b b 0 1 6 7 8 9 10   b 4 5  Example • ( a | b )* a b b

Conversion of an NFA to a DFA • The subset construction algorithm converts an NFA into a DFA using the following operation.

Subset Construction(1) Initially, -closure(s0) is the only state in Dstates and it is unmarked; while (there is an unmarked state T in Dstates){ mark T;for (each input symbol a   ) { U = -closure( move(T, a) );if (U is not in Dstates) add U as an unmarked state to DstatesDtran[T, a] = U } }

Computing ε- closure(T)

 a 2 3     start a b b 0 1 6 7 8 9 10   b 4 5  Example • ( a | b )* a b b b C b a b start a b b A B D E a a a

a 1 2  start a b b 3 0 4 5 6  a b  7 8 b Example • a • abb • a*b+ DstatesA = {0,1,3,7}B = {2,4,7}C = {8}D = {7}E = {5,8}F = {6,8} a 0137 247 a b 7 b b b 8 68 58 b b

Minimizing the DFA • Step 1 • Start with an initial partition II with two group: F and S-F (aceepting and nonaccepting) • Step 2 • Split Procedure • Step 3 • If ( IInew = II ) IIfinal = II and continue step 4 else II = IInewand go to step 2 • Step 4 • Construct the minimum-state DFA by IIfinal group. • Delete the dead state

Split Procedure Initially, let IInew = II for ( each group G of II ) { Partition G into subgroup such that two states s and t are in the same subgroup if and only if for all input symbol a, states s and t have transition on a to states in the same group of II. /* at worst, a state will be in a subgroup by itself */ replace G in IInew by the set of all subgroup formed }

Example • initially, two sets {1, 2, 3, 5, 6}, {4, 7}. • {1, 2, 3, 5, 6} splits {1, 2, 5}, {3, 6} on c. • {1, 2, 5} splits {1}, {2, 5} on b.

Minimizing the DFA • Major operation: partition states into equivalent classes according to • final / non-final states • transition functions ( A B C D E ) ( A B C D ) ( E ) ( A B C ) ( D ) ( E ) ( A C ) ( B ) ( D ) ( E )

Important States of an NFA • The “important states” of an NFA are those without an -transition, that is • if move({s}, a) for some athen s is an important state • The subset construction algorithm uses only the important states when it determines-closure ( move(T, a) ) • Augment the regular expression r with a special end symbol # to make accepting states important: the new expression is r#

Converting a RE Directly to a DFA • Construct a syntax tree for (r)# • Traverse the tree to construct functions nullable, firstpos, lastpos, and followpos • Construct DFAD by algorithm 3.62

Function Computed From the Syntax Tree • nullable(n) • The subtree at node n generates languages including the empty string • firstpos(n) • The set of positions that can match the first symbol of a string generated by the subtree at node n • lastpos(n) • The set of positions that can match the last symbol of a string generated be the subtree at node n • followpos(i) • The set of positions that can follow position i in the tree

Rules for Computing the Function

Computing followpos for (eachnode n in the tree) { //n is a cat-node with left child c1 and right child c2 if ( n == c1．c2) for (each i in lastpos(c1) ) followpos(i) = followpos(i)  firstpos(c2);else if (n is a star-node) for ( eachi in lastpos(n) )followpos(i) = followpos(i)  firstpos(n); }

Converting a RE Directly to a DFA Initialize Dstates to contain only the unmarked state firstpos(n0), where n0 is the root of syntax tree T for (r)#; while ( there is an unmarked state Sin Dstates) { mark S; for (each input symbol a  ) {let U be the union of followpos(p) for all p in S that correspond to a; if (U is not in Dstates )add U as an unmarked state to DstatesDtran[S,a] = U; } }

○ # ○ 6 b ○ 5 n b ○ 4 a * 3 | a b 1 2 Example ( a | b )* a b b # n = ( a | b )* a nullable(n) = false firstpos(n) = { 1, 2, 3 } lastpos(n) = { 3 } followpos(1) = {1, 2, 3 }

Example {1, 2, 3} {6} ( a | b )* a b b # # {6} {6} {1, 2, 3} {5} 6 b {1, 2, 3} {4} {5} {5} nullable 5 b {1, 2, 3} {3} {4} {4} 4 firstpos lastpos a {3} {3} {1, 2} * {1, 2} 3 | {1, 2} {1, 2} a b {1} {1} {2} {2} 1 2

Chapter 3

Chapter 3

Presentation Transcript

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

chapter 3

CHAPTER 3-3

Chapter 3-3

Chapter 3 Chapter 3

CHAPTER 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

CHAPTER 3

Chapter 3