Lexical Analysis with lex (1) and flex(1)

Reading • Read Sections 3-5 of Lexical Analysis with Flex • Check out the class lecture notes • Ask questions from either source • Preferred venues: in-class, or in CS Forums

Traits of Scanners • Function: convert from chars to tokens • Identify and categorize kinds of tokens • Detect boundaries between tokens • Discard comments and whitespace • Remember line/col #’s for error reporting • Report lexical errors • Run as fast as possible

Regular Expressions • ε is a r.e. • Any char in the alphabet is a r.e. • If r and s are r.e.’s then r | s is a r.e. • If r and s are r.e.’s then r s is a r.e. • If r is a r.e. then r* is a r.e. • If r is a r.e. then (r) is a r.e.

Common extensionsto regular expression notation • r+ is equivalent to rr* • r? is equivalent to r|ε • [abc] is equivalent to a|b|c • [a-z] is equivalent to a | b| … |z • [^abc] is equivalent to anything but a,b, or c

Lex’s extended regular expressions • \c escapes for most operators • “s” match C string as-is (superescape) • r{m,n} match r between m and n times • r/s match r when s follows • ^r match r when at beginning of line • r$ match r when at end of line

Lexical Attributes • A lexical attribute is a piece of information about a token • Compiler writer can define as needed • Typically: • Category integer code, used in parsing • Lexeme actual string as appears in source • Line, column location in source code • Value for literals, the binary they represent

Meanings of the word “token” • A single word from the source code • An integer code that categorizes a word • A set of lexical attributes that are computed from a single word of input • An instance of a class (given by category)

Lex public interface • FILE *yyin; /* set before calling yylex() */ • intyylex(); /* call once per token */ • char yytext[]; /* chars matched by yylex() */ • intyywrap(); /* end-of-file handler */

.l file format header %% body %% helper functions

Lex header • C code inside %{ … %} • prototypes for helper functions • #include’s that #define integer token categories • Macro definitions, e.g. letter [a-zA-Z] digit [0-9] ident {letter}({letter}|{digit})* • Warning: macros are fraught with peril

Lex body • Regular expressions with semantic actions “ “ { /* discard */ } {ident} { return IDENT; } “*” { return ASTERISK; } “.” { return PERIOD; } • Match the longest r.e. possible • Break ties with whichever appears first • If it fails to match: copy unmatched to stdout

Lex helper functions • Follows rules of ordinary C code • Compute lexical attributes • Do stuff the regular expressions can’t do • Write a yywrap() to switch files on EOF

structtoken – typical compiler struct token { int category; char *text; intlinenumber; int column; char *filename; union literal value; }

“string removal tool” %% “zap me”

whitespace trimmer %% [ \t]+ putchar(‘ ‘); [ \t]+ /* drop entirely */

string replacement %% username printf(“%s”, getlogin() );

Line/word counter int lines=0, chars=0; %% \n ++lines; ++chars; . ++chars; %% main() { yylex(); printf(“lines: %d chars: %d\n”, lines, chars); }

Example: C reals • Is it: [0-9]*.[0-9]* • Is it: ([0-9]+.[0-9]* | [0-9]*.[0-9]+)

Lexical Analysis with lex (1) and flex(1)