800 likes | 833 Vues
Lecture 4 Lexical Analysis. Xiaoyin Wang CS5363 Programming Languages and Compilers. Last Class. Finite State Machines From Grammar to FSM DFA vs. NFA. Today’s Class. Scanner Concepts Tokens Basic Strategy Implementation Lexical Errors. Scanner Example. Input text
E N D
Lecture 4 Lexical Analysis Xiaoyin Wang CS5363 Programming Languages and Compilers
Last Class • Finite State Machines • From Grammar to FSM • DFA vs. NFA
Today’s Class • Scanner Concepts • Tokens • Basic Strategy • Implementation • Lexical Errors
Scanner Example • Input text • // this statement does very little • if (x >= y) y = 42; • Token Stream • Note: tokens are atomic items, not character strings IF LPAREN ID(x) GEQ ID(y) ID(y) BECOMES INT(42) SCOLON RPAREN
token source program get next token lexical analyzer symbol table parser Lexical Analyzer in Perspective Important Issue: What are Responsibilities of each Box ? Focus on Lexical Analyzer and Parser
The Input • Read string input • Might be sequence of characters (Unix) • Might be sequence of lines (VMS) • Character set • ASCII • ISO Latin-1 • ISO 10646 (16-bit = unicode) • Others (EBCDIC, JIS, etc)
The Output • A series of tokens • Punctuation ( ) ; , [ ] • Operators + - ** := • Keywords begin end if • Identifiers Square_Root • String literals “hello this is a string” • Character literals ‘x’ • Numeric literals 123 4_5.23e+2 16#ac#
Lexical Analysis: Terminology token: a name for a set of input strings with related structure. Example: “identifier,” “integer constant” pattern: a rule describing the set of strings associated with a token. Example: “a letter followed by zero or more letters, digits, or underscores.” lexeme: the actual input string that matches a pattern. Example: count
Examples Input: count = 123 Tokens: identifier : Rule: “letter followed by …” Lexeme: count assg_op : Rule: = Lexeme: = integer_const : Rule: “digit followed by …” Lexeme: 123
Attributes for Tokens If more than one lexeme can match the pattern for a token, the scanner must indicate the actual lexeme that matched. This information is given using an attribute associated with the token. Example: The program statement count = 123 yields the following token-attribute pairs: identifier,pointer to the string “count”, … assg_op, … integer_const,the integer value 123, …
Today’s Class • Scanner Concepts • Tokens • Basic Strategy • Implementation • Lexical Errors
Token Sample Lexemes Informal Description of Pattern const if relation id num literal const if <, <=, =, < >, >, >= pi, count, D2 3.1416, 0, 6.02E23 “core dumped” const if < or <= or = or < > or >= or > letter followed by letters and digits any numeric constant any characters between “ and “ except “ Actual values are critical. Info is : 1. Stored in symbol table 2. Returned to parser Classifies Pattern Introducing Basic Terminology
Punctuation • Typically individual special characters • Such as ‘(‘, ‘)’ • Sometimes double characters • E.g. (* treated as a kind of bracket • Returned just as identity of token • And perhaps location: For error message and debugging purposes
Operators • Like punctuation • No real difference for lexical analyzer • Typically single or double special chars • Operators + - • Operations := • Returned just as identity of token • And perhaps location
Keywords • Reserved identifiers • E.g. BEGIN END in Pascal, if in C • Returned just as token identity • With possible location information • Unreserved keywords (e.g. PL/1) • Handled as identifiers (parser distinguishes)
Identifiers • Rules differ • Length, allowed characters, separators • Need to build table • So that a1 is recognized as a1 • Typical structure: hash table • Lexical analyzer returns token type • And key to table entry • Table entry includes location information
Numeric Literals • Also need a table • Typically record value • E.g. 123 = 0123 = 01_23 (Ada) • But usually do not use int for values • Because may have different characteristics • Float stuff much more complex • Denormal numbers, correct rounding • Very delicate stuff
String Literals • Text must be stored • Actual characters are important • Not like identifiers • Character set issues • Table needed • Lexical analyzer returns key to table • May or may not be worth hashing
Character Literals • Similar issues to string literals • Lexical Analyzer returns • Token type • Identity of character • Note, cannot assume character set of host machine, may be different
Handling Comments • Comments have no effect on program • Can therefore be eliminated by scanner • But may need to be retrieved by tools • Error detection issues • E.g. unclosed comments • Scanner does not return comments
Case Sensitiveness • Some languages have case equivalence • Pascal, Ada • Some do not • C, Java • Lexical analyzer ignores case if needed • This_Routine = THIS_RouTine • Error analysis may need exact casing
Today’s Class • Scanner • Tokens • Basic Strategy • Implementation • Lexical Errors
Basic Strategy • Parse input to generate the token stream • Model the whole language as a regular language • Parse the input!
Basic Strategy • Each token can be modeled with a finite state machine • Language = Token*
Example of Token Models Number literal
Example of Token Models Number literal
Combine All Models Token* ...
Basic Strategy • However, this will not work … • The reason • The grammar is ambiguous • One token can be prefix of the other • ‘abc’ can be • ‘a’ and ‘bc’ | ‘ab’ and ‘c’ | ‘abc’ • ‘<=’ can be • ‘<’ and ‘=’ | ‘<=’
Basic Strategy • The rule to remove ambiguity • Always identify the shortest token • Basic strategy works, but programs are hard to write • Always identify the longest token • We need to revise the basic strategy • Add a ‘backspace’ operation after finding a token
Combine All Models with Backspace Token* ...
q1 q2 q7 q6 q3 q4 q5 An Example 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/)
q1 q2 q7 q6 q3 q4 q5 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/)
q1 q2 q7 q6 q3 q4 q5 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/)
q1 q2 q7 q6 q3 q4 q5 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF
q1 q2 q7 q6 q3 q4 q5 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF
q1 q2 q7 q6 q3 q4 q5 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF (
q1 q2 q7 q6 q3 q4 q5 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF (
q1 q2 q7 q6 q3 q4 q5 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF (
q1 q7 q5 q3 q4 q2 q6 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF ( id: A2
q1 q7 q5 q3 q4 q2 q6 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF ( id: A2
q1 q7 q5 q3 q4 q2 q6 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF ( id: A2
q1 q7 q5 q3 q4 q2 q6 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF ( id: A2 >=
q1 q7 q5 q3 q4 q2 q6 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF ( id: A2 >=
q1 q7 q5 q3 q4 q2 q6 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF ( id: A2 >=
q1 q7 q6 q4 q3 q5 q2 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF ( id: A2 >= Num: 35
q1 q7 q6 q4 q3 q5 q2 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF ( id: A2 >= Num: 35
q1 q7 q6 q4 q3 q5 q2 0-9 Non-digit/number 1-9 Non-digit/0 0 any/>= = qx > q0 other/> A-Z other/id ( To q0 A-Z0-9 any/( ) any/) IF ( id: A2 >= Num: 35 )
Transition diagrams Transition diagram for relop
Transition diagrams (cont.) Transition diagram for reserved words and identifiers