Overview of Lexical Analysis in Compiler: Front-End Duties & Tokenization Process

Lexical Analysis Compiler Baojian Hua bjhua@ustc.edu.cn

Compiler compiler source program target program

Front and Back Ends front end back end source program target program IR

Front End lexical analyzer source code tokens abstract syntax tree parser semantic analyzer IR

Lexical Analyzer • The lexical analyzer translates the source program into a stream of lexical tokens • Source program: • stream of characters • vary from language to language (ASCII or Unicode, or …) • Lexical token: • compiler internal data structure that represents the occurrence of a terminal symbol • vary from compiler to compiler

Conceptually lexical analyzer character sequence token sequence

Example val x = 3; val y = 4; val z = if (2) then (x) else y; val _ = printInt z; lexical analysis VAL IDENT(x) ASSIGN INT(3) SEMICOLON VAL IDENT(y) ASSIGN INT(4) SEMICOLON VAL IDENT(z) ASSIGN IF LPAREN INT(2) RPAREN THEN LPAREN IDENT(x) RPAREN ELSE IDENT(y) SEMICOLON VAL UNDERSCORE ASSIGN PRINTINT INDENT(z) SEMICOLON EOF

Lexer Implementation • Options: • Write a lexer by hand from scratch • boring, error-prone, and too much work • see dragon book sec3.4 • Automatic lexer generator • Quick and easy

Lexer Implementation declarative specification lexical analyzer

Regular Expressions • How to specify a lexer? • Develop another language • Regular expressions • What’s a lexer-generator? • Another compiler…

Basic Definitions • Alphabet: the char set (say ASCII or Unicode) • String: a finite sequence of char from alphabet • Language: a set of strings • finite or infinite • say the C language

Regular Expression (RE) • Construction by induction • each c \in alphabet • {a} • empty \eps • {} • for M and N, then M|N • (a|b) = {a, b} • for M and N, then MN • (a|b)(c|d) = {ac, ad, bc, bd} • for M, then M* (Kleen closure) • (a|b)* = {\eps, a, aa, b, ab, abb, baa, …}

Regular Expression • Or more formally: e -> {} | c | e | e | e e | e*

Example • C’s indentifier: • starts with a letter (“_” counts as a letter) • followed by zero or more of letter or digit (…) (…) (_|a|b|…|z|A|B|…|Z) (…) (_|a|b|…|z|A|B|…|Z)(_|a|b|…|z|A|B|…|Z|0|…|9) (_|a|b|…|z|A|B|…|Z)(_|a|b|…|z|A|B|…|Z|0|…|9)* • It’s really error-prone and tedious…

Syntax Sugar • More syntax sugar: • [a-z] == a|b|…|z • e+ == one or more of e • e? == zero or one of e • “a*” == a* itself • e{i, j} == more than i and less than j of e • . == any char except \n • All these can be translated into core RE

Example Revisted • C’s indentifier: • starts with a letter (“_” counts as a letter) • followed by zero or more of letter or digit (…) (…) (_|a|b|…|z|A|B|…|Z) (…) (_|a|b|…|z|A|B|…|Z)(_|a|b|…|z|A|B|…|Z|0|…|9) [_a-zA-Z][_a-zA-Z0-9]* • What about the key word “if”?

Ambiguous Rule • A single RE is not ambiguous • But in a language, there may be many REs? • [_a-zA-Z][_a-zA-Z0-9]* • “if” • So, for a string, which RE to match?

Ambiguous Rule • Two conventions: • Longest match: The regular expression that matches the longest string takes precedence. • Rule Priority: The regular expressions identifying tokens are written down in sequence. If two regular expressions match the same (longest) string, the first regular expression in the sequence takes precedence.

Lexer Generator History • Lexical analysis was once a performance bottleneck • certainly not true today! • As a result, early research investigated methods for efficient lexical analysis • While the performance concerns are largely irrelevant today, the tools resulting from this research are still in wide use

History: A long-standing goal • In this early period, a considerable amount of study went into the goal of creating an automatic compiler generator (aka compiler-compiler) declarative compiler specification compiler

History: Unix and C • In the mid-1960’s at Bell Labs, Ritchie and others were developing Unix • A key part of this project was the development of C and a compiler for it • Johnson, in 1968, proposed the use of finite state machines for lexical analysis and developed Lex [CACM 11(12), 1968] • read the accompanying paper on course page • Lex realized a part of the compiler-compiler goal by automatically generating fast lexical analyzers

The Lex tool • The original Lex generated lexers written in C (C in C) • Today every major language has its own lex tool(s): • sml-lex, ocamllex, JLex, C#lex, … • Our topic next: • sml-lex • concepts and techniques apply to other tools

SML-Lex Specification • Lexical specification consists of 3 parts (yet another programming language): User Declarations (plain SML types, values, functions) %% SML-LEX Definitions (RE abbreviations, special stuff) %% Rules (association of REs with tokens) (each token will be represented in plain SML)

User Declarations • User Declarations: • User can define various values that are available to the action fragments. • Two values must be defined in this section: • type lexresult • type of the value returned by each rule action. • fun eof () • called by lexer when end of input stream is reached. (EOF)

SML-LEX Definitions • ML-LEX Definitions: • User can define regular expression abbreviations: • Define multiple lexers to work together. Each is given a unique name. digits = [0-9] +; letter = [a-zA-Z]; %s lex1 lex2 lex3;

Rules • Rules: • A rule consists of a pattern and an action: • Pattern in a regular expression. • Action is a fragment of ordinary SML code. • Longest match & rule priority used for disambiguation • Rules may be prefixed with the list of lexers that are allowed to use this rule. <lexerList> regularExp => (action) ;

Rules • Rule actions can use any value defined in the User Declarations section, including • type lexresult • type of value returned by each rule action • val eof : unit -> lexresult • called by lexer when end of input stream reached • special variables: • yytext: input substring matched by regular expression • yypos: file position of the beginning of matched string • continue (): doesn’t return token; recursively calls lexer

Example #1 (* A language called Toy *) prog -> word prog -> word -> symbol -> number symbol -> [_a-zA-Z][_0-9a-zA-Z]* number -> [0-9]+

Example #1 (* Lexer Toy, see the accompany code for detail *) datatype token = Symbol of string * int | Number of string * int exception End type lexresult = unit fun eof () = raise End fun output x = …; %% letter = [_a-zA-Z]; digit = [0-9]; ld = {letter}|{digit}; symbol = {letter} {ld}*; number = {digit}+; %% <INITIAL>{symbol} =>(output (Symbol(yytext, yypos))); <INITIAL>{number} =>(output (Number(yytext, yypos)));

Example #2 (* Expression Language * C-style comment, i.e. /* … */ *) prog -> stms stms -> stm; stms -> stm -> id = e -> print e e -> id -> num -> e bop e -> (e) bop -> + | - | * | /

Sample Program x = 4; y = 5; z = x+y*3; print z;

Example #2 (* All terminals *) prog -> stms stms -> stm; stms -> stm -> id = e -> print e e -> id -> num -> e bop e -> (e) bop -> + | - | * | /

Example #2 in Lex (* Expression language, see the accompany code * for detail. * Part 1: user code *) datatype token = Id of string * int | Number of string * int | Print of string * int | Plus of string * int | … (* all other stuffs *) exception End type lexresult = unit fun eof () = raise End fun output x = …;

Example #2 in Lex, cont’ (* Expression language, see the accompany code * for detail. * Part 2: lex definition *) %% letter = [_a-zA-Z]; digit = [0-9]; ld = {letter}|{digit}; sym = {letter} {ld}*; num = {digit}+; ws = [\ \t]; nl = [\n];

Example #2 in Lex, cont’ (* Expression language, see the accompany code * for detail. * Part 3: rules *) %% <INITIAL>{ws} =>(continue ()); <INITIAL>{nl} =>(continue ()); <INITIAL>”+” =>(output (Plus (yytext, yypos))); <INITIAL>”-” =>(output (Minus (yytext, yypos))); <INITIAL>”*” =>(output (Times (yytext, yypos))); <INITIAL>”/” =>(output (Divide (yytext, yypos))); <INITIAL>”(” =>(output (Lparen (yytext, yypos))); <INITIAL>”)” =>(output (Rparen (yytext, yypos))); <INITIAL>”=” =>(output (Assign (yytext, yypos))); <INITIAL>”;” =>(output (Semi (yytext, yypos)));

Example #2 in Lex, cont’ (* Expression language, see the accompany code * for detail. * Part 3: rules cont’ *) <INITIAL>”print”=>(output (Print(yytext, yypos))); <INITIAL>{sym} =>(output (Id (yytext, yypos))); <INITIAL>{num} =>(output (Number(yytext, yypos))); <INITIAL>”/*” => (YYBEGIN COMMENT; continue ()); <COMMENT>”*/” => (YYBEGIN INITIAL; continue ()); <COMMENT>{nl} => (continue ()); <COMMENT>. => (continue ()); <INITIAL>. => (error (…));

Lex Implementation • Lex accepts regular expressions (along with others) • So SML-lex is a compiler from RE to a lexer • Internal: RE  NFA  DFA  table-driven alog’

M Input String {Yes, No} Finite-state Automata (FA) M = (, S, q0, F, ) Transition function Input alphabet State set Final states Initial state

Transition functions • DFA • : S    S • NFA • : S    (S)

a a 0 1 2 b b a,b DFA example • Which strings of as and bs are accepted? • Transition function: • { (q0,a)q1, (q0,b)q0, (q1,a)q2, (q1,b)q1, (q2,a)q2, (q2,b)q2 }

a,b 0 1 b a b NFA example • Transition function: • {(q0,a){q0,q1}, (q0,b){q1}, (q1,a), (q1,b){q0,q1}}

RE -> NFA:Thompson algorithm • Break RE down to atoms • construct small NFAs directly for atoms • inductively construct larger NFAs from small NFAs • Easy to implement • a small recursion algorithm

RE -> NFA:Thompson algorithm e ->  -> c -> e1 e2 -> e1 | e2 -> e1*  c  e2 e1

RE -> NFA:Thompson algorithm e ->  -> c -> e1 e2 -> e1 | e2 -> e1* e1     e2    e1 

Example %% letter = [_a-zA-Z]; digit = [0-9]; id = {letter} ({letter}|{digit})* ; %% <INITIAL>”if” => (IF (yytext, yypos)); <INITIAL>{id} => (Id (yytext, yypos)); (* Equivalent to: * “if” | {id} *)

Example <INITIAL>”if” => (IF (yytext, yypos)); <INITIAL>{id} => (Id (yytext, yypos));  f i    …

NFA -> DFA:Subset construction algorithm (* subset construction: workList algorithm *) q0 <- e-closure (n0) Q <- {q0} workList <- q0 while (workList != \phi) remove q from workList foreach (character c) t <- -closure (move (q, c)) D[q, c] <- t if (t\not\in Q) add t to Q and workList

NFA -> DFA:-closure (* -closure: fixpoint algorithm *) (* Dragon Fig 3.33 gives a DFS-like algorithm. * Here we give a recursive version. (Simpler) *) X <- \phi fun eps (t) = X <- X ∪ {t} foreach (s \in one-eps(t)) if (s \not\in X) then eps (s)

NFA -> DFA:-closure (* -closure: fixpoint algorithm *) (* dragon Fig 3.33 gives a DFS-like algorithm. * Here we give a recursive version. (Simpler) *) fun e-closure (T) = X <- T foreach (t \in T) X <- X ∪ eps(t)

Overview of Lexical Analysis in Compiler: Front-End Duties & Tokenization Process

Overview of Lexical Analysis in Compiler: Front-End Duties & Tokenization Process

Presentation Transcript

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

LEXICAL ANALYSIS

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis