1 / 111

CHAPTER 3 LEXICAL ANALYSIS

CHAPTER 3 LEXICAL ANALYSIS. From: Chapter 3, The Dragon Book and Qin book. Sequence of characters. Sequence of tokens.

urbana
Télécharger la présentation

CHAPTER 3 LEXICAL ANALYSIS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CHAPTER 3 LEXICAL ANALYSIS From: Chapter 3, The Dragon Book and Qin book Sequence of characters Sequence of tokens

  2. Construct a diagram that illustrates the structure of the tokens of the source language , and then to hand-translate the diagram the diagram into a program for finding tokens 3.0 Approaches to implement a lexical analyzer

  3. Pattern Matching technique • Specify and design program that execute actions triggered(触发) by patterns in strings • Introduce a pattern-action language called Lex for specifying lexical analyzers • Patterns are specified by regular expressions • A compiler for Lex can generate an efficient finite automation recognizer for the regular expressions

  4. E=M*C**2 A Simple Lexical Analyzer Keyword? Identifier? Operators? ??? Identifier Identifier pattern matching

  5. Grammar: Identifier pattern IdLetter(Letter|Digit) Lettera|b|c|……|z Digit0|1|2|……|9 • regular expressions: (a|b|c|……|z)(a|b|c|……|z| 0|1|2|……|9)*

  6. Finite State Automata letter letter 1 2 digit

  7. Bool identifier_pattern_matching(char*) { int flag=1; Ch=Read(); If ch>=“a”&ch<=“z” Flag=2; else return 0; while flag==2 ch=read(); if (ch>=“a”&ch<=“z”)||(ch>=“0”&ch<=“9”) flag=2; Return 1; }

  8. Lexical analyzers are divided two processes: Scanning No tokenization of the input deletion of comments, compaction of whitespace characters Lexical analysis Producing tokens 3.1 The role of the lexical analyzer

  9. token Lexical analyzer Source program Parser Get next token Symbol table

  10. 3.1.1 Reasons why the separation of lexicalanalysis and parsing Simplicity of design is the most important consideration. Compiler efficiency is improved. Compiler portability is enhanced.

  11. Atokenis a pair consisting of a token name and an optional attribute value. Token name: Keywords, operators, identifiers, constants, literal strings, punctuation symbols(such as commas,semicolons) Apatternis a description of the form that the lexemes of token may take. A lexeme is a sequence of characters in the source program that matches the patter for a token and is identified by the lexical analyzer as an instance of that token.E.g.Relation {<.<=,>,>=,==,<>} 3.1.2 Tokens(表征), Patterns(模式), and Lexemes(词)

  12. 3.1.3 Attributes for Tokens A pointer to the symbol-table entry in which the information about the token is kept E.g3.2 E=M*C**2 <id, pointer to symbol-table entry for E> <assign_op,> <id, pointer to symbol-table entry for M> <multi_op,> <id, pointer to symbol-table entry for C> <exp_op,> <num,integer value 2>

  13. 3.1.4 Lexical Errors It is hard for a lexical analyzer to tell, without the aid of other components, that there is a source-code error. E.g., fi ( a == f(x)) ... Keyword “if”? an identifier?

  14. Suppose a situation in which none of the patterns for tokens matches a prefix of the remaining input. E.g. $%#if a>0 a+=1;

  15. The simplest recovery strategy is “panic(野蛮) mode” recovery. Delete successive characters from the remaining input until the lexical analyzer can find a well-formed token. This technique may occasionally confuse the parser, but in an interactive computing environment it may be quit adequate.

  16. Other possible error-recovery actions Delete one extraneous character from the remaining input. Insert a missing character into the remaining input. Replace a character by another character. Transpose two adjacent characters.

  17. Examining ways of speeding reading the source program Two-buffer scheme handling large lookahead safely 3.2 Input Buffering

  18. Two buffers of the same size, say 4096, are alternately reloaded. Two pointers to the input are maintained: Pointer lexeme_Begin marks the beginning of the current lexeme. Pointer forward scans ahead until a pattern match is found. 3.2.1 Buffer Pairs

  19. If forward at end of first half then begin reload second half; forward:=forward + 1; End Else if forward at end of second half then begin reload first half; move forward to beginning of first half End Else forward:=forward + 1;

  20. 3.2.2 Sentinels E = M * eof C * * 2 eofeof

  21. forward:=forward+1; If forward at end of first half then begin reload second half; forward:=forward + 1; End Else if forward at end of second half then begin reload first half; move forward to beginning of first half End Else terminate lexical analysis;

  22. How can deal with a long and long and……long lexeme, this is a problem in the two buffer scheme. DECLARE(ARG1, ARG2,……,ARGn) E.g. When a function is rewritten in c++, a function name is represent several function.

  23. 3.3 Specification of Tokens • Regular expressions are an important notation for specifying token patterns. • Study formal notations for regular expressions. • these expressions are used in lexical-analyzer generator. • Sec. 3.7 shows how to build the lexical analyzer by converting regular expressions to automata.

  24. 1、Regular Definition of Tokens Defined in regular expression e.g. identifier can be defined by regular Grammar Id  letter(letter|digit) letter A|B|…|Z|a|b|…|z digit0|1|2|…|9 Identifier can also be expressed by following regular expression (A|B|…|Z|a|b|…|z)(A|B|…|Z|a|b|…|z| 0|1|2|…|9)*

  25. Regular expressions are an important notation for specifying patterns. Each pattern matches a set of strings, so regular expressions will serve as names for sets of strings.

  26. 2、Regular Expression & Regular language Regular Expression A notation that allows us to define a pattern in a high level language. Regular language Each regular expression r denotes a language L(r) (the set of sentences relating to the regular expression r)

  27. Each token in a program can be expressed in a regular expression

  28. 3、The construct rule of regular expression over alphabet   is a regular expression that denote {}  is regular expression {} is the related regular language 2) If a is a symbol in , then a is a regular expression that denotes {a} a is regular expression {a} is the related regular language

  29. 3) Suppose  and  are regular expressions, then |, , (), * , * is also a regular expression L(|)= L()L() L()= L()L() L(())= L() L(*)={}L()L()L()... L()…  L()

  30. 4、Algebraic laws of regular expressions 1) |= | 2) |(|)=(|)|() =( ) 3) (| )=  | (|)= |  4)  =  =  5)(*)*=* 6) *=+|+=  *= * 7) (|)*= (* | *)*= (**)*

  31. 8) If L(),then = |   = *  = |   =  * Notes: We assume that the precedence of * is the highest, the precedence of | is the lowest and they are left associative

  32. Example unsigned numbers such as 5280, 39.37, 6.336E4, 1.894E-4 digit0 | 1 |……| 9 digits digit digit* optional_fraction .digits| optional_exponent (E(+|-|  )digits)| num digits optional_fraction optional_exponent

  33. 5、Notational Short-hands a)One or more instances ( r )+ digit+ b)Zero or one instance r? is a shorthand for r| (E(+|-)?digits)? c)Character classes [a-z] denotes a|b|c|…|z [A-Za-z] [A-Za-z0-9]

  34. 1、Task of recognition of token in a lexical analyzer Isolate the lexeme for the next token in the input buffer Produce as output a pair consisting of the appropriate token and attribute-value, such as <id,pointer to table entry> , using the translation table given in the Fig in next page 3.4 Recognition of Tokens

  35. 2、Methods to recognition of token Use Transition Diagram

  36. 3、Transition Diagram(Stylized flowchart) Depict the actions that take place when a lexical analyzer is called by the parser to get the next token

  37. Accepting state start > = return(relop,GE) 0 6 7 other Start state * 8 return(relop,GT) Notes: Here we use ‘*’ to indicate states on which input retraction must take place

  38. 4、Implementing a Transition Diagram Each state gets a segment of code If there are edges leaving a state, then its code reads a character and selects an edge to follow, if possible Use nextchar() to read next character from the input buffer

  39. while (1) { switch(state) { case 0: c=nextchar(); if (c==blank || c==tab || c==newline){ state=0;lexeme_beginning++} else if (c== ‘<‘) state=1; else if (c==‘=‘) state=5; else if(c==‘>’) state=6 else state=fail(); break case 9: c=nextchar(); if (isletter( c)) state=10; else state=fail(); break … }}}

  40. 5、A generalized transition diagram Finite Automation Deterministic or non-deterministic FA Non-deterministic means that more than one transition out of a state may be possible on the the same input symbol

  41. Input buffer Lexeme_beginning FA simulator i f d 2 =… 6、The model of recognition of tokens

  42. e.g:The FA simulator for Identifiers is: Which represent the rule: identifier=letter(letter|digit)* letter letter 1 2 digit

  43. 1、Usage of FA Precisely recognize the regular sets A regular set is a set of sentences relating to a regular expression 2、Sorts of FA Deterministic FA Non-deterministic FA 3.5 Finite automata

  44. 3、Deterministic FA (DFA) DFA is a quintuple, M(S,,move,s0,F) S: a set of states : the input symbol alphabet move: a transition function, mapping from S  to S, move(s,a)=s’ s0: the start state, s0 ∈ S F: a set of states F distinguished as accepting states, FS

  45. Note: 1) In a DFA, no state has an -transition; 2)In a DFA, for each state s and input symbol a, there is at most one edge labeled a leaving s 3)To describe a FA,we use the transition graph or transition table 4)A DFA accepts an input string x if and only if there is some path in the transition graph from start state to some accepting state

  46. e.g. DFA M=({0,1,2,3},{a,b},move,0,{3}) Move: move(0,a)=1 m(0,b)=2 m(1,a)=3 m(1,b)=2 m(2,a)=1 m(2,b)=3 m(3,a)=3 m(3,b)=3 Transition table 1 a a a b a 0 3 b b b 2 Transition graph

  47. e.g. Construct a DFA M,which can accept the a, b, c strings which begin with a or b, or begin with c and contain at most one a。Please write a C++ function to implement the DFA. b b c a 0 2 3 c c a b c a 1 b

More Related