140 likes | 280 Vues
Today’s Agenda. Compilation > Syntax Analysis > Lexical Analysis. Typical Compiler - Phases. Front End. P’ in M. Scanner. Parser. Instruction Scheduler. Type Checker. Semantic Analyzer. Register Allocator. IR Generator. Instruction Selector. Back End. Compiler. P in L.
E N D
Today’s Agenda Compilation > Syntax Analysis > Lexical Analysis
Typical Compiler - Phases Front End P’ in M Scanner Parser Instruction Scheduler Type Checker Semantic Analyzer Register Allocator IR Generator Instruction Selector Back End Compiler P in L Code Emitter Optimizers IR
Syntax Analysis • Typically includes • Parsing and Scanning • Syntax Specification for a natural language • Grammar for sentences and paragraphs. • Words are defined by convention • Syntax Specification for a PL • Grammar for constructs - expressions, statements, declarations/definitions (of procedures/variables/types/modules etc.) • Typically a Context Free Grammar covers most of it. • Words and Symbols are precisely defined • using regular expressions
Lexical Analysis • Requirement Specification: • Input – Program as a text stream • Output – Token Stream • Chaff (what’s filtered out) – Comments, White Space (Blanks, Tabs, Newlines) • Error – Invalid token, Invalid comment, • Example Use Case Scenario: • Parser requests Scanner: getNextToken() • Scanner returns: a token if available • The token is a terminal symbol in the Grammar (defining the lang. recognized by the parser).
Lexical Analysis module main import math.*; /* Function main */ var main : integer --> integer; main := fun (in x : integer) ret integer { return f(x); } endmodule TK_KEY_MODULE, TK_IDENT (main), TK_KEY_IMPORT, TK_IDENT (math), TK_PERIOD, TK_STAR, TK_SEMI, TK_KEY_VAR, TK_IDENT (main), TK_COLON, TK_KEY_INTEGER, TK_LONGARROW, TK_KEY_INTEGER, TK_SEMI, TK_IDENT (main), TK_ASSIGN, TK_KEY_FUN, TK_LPAREN, TK_KEY_IN, TK_IDENT (x), TK_COLON, TK_KEY_INTEGER, TK_RPAREN, TK_KEY_RET, TK_KEY_INTEGER, TK_LBRAC, TK_KEY_RETURN, TK_IDENT (f), TK_LPAREN, TK_IDENT (x), TK_RPAREN, TK_SEMI, TK_RBRAC, TK_KEY_ENDMOD
Lexical Analysis • Special Cases (The C Prog. Lang.): • Uses a pre-processor (or macro-processor) before lexical analysis • Processes instructions of the form • #define max 1000 • #include <stdio.h> • #define f(x) (x x) • #ifdef (x)
Lexical Analysis • Token definitions • Most tokens are fixed text strings - specified by singleton regular languages: • Examples: “(” , “->”, “>>” • Special mentions : identifiers, literals (numbers, strings, characters), comments • Identifiers: ALP (ALP | DIG )* • ALP is an alphabetic character; DIG is a digit character • Some languages may allow some special chars. • C allows _ • Scheme allows almost any printable char.
Lexical Analysis - Implementation • Token Definitions: • (Reserved) Keywords: • Specific form of identifiers - reserved in some languages (e.g. C , Java) but not in some (Scheme, FORTRAN IV) • Numbers: DIG* (“.” DIG+)? • Exercise: Regular expression for C-style comments!
Lexical Analysis • Implementation (from scratch) • Construct a finite automaton (first an NFA, then convert into DFA). • Use a loop and switch (on char.) to model the DFA’s transitions. • Note that a state in the DFA is the accumulated string and a final state identifies a token type as well. • Each token is a reg. expr - hence has an equi. DFA; • Scanner is the mega DFA which recognizes the union of all the token languages.
Lexical Analysis - Implementation • Implementation of FA: • Avoid “Goto” statements • “Goto” Statements are harmful: • Reduced readability Reduced Maintainability • Refer to Dijkstra’s article and Knuth’s article • Efficiency issues in modern platforms • Pipeline interruptions • Instruction Pre-fetch / Cache interruption • Page Faults • Common principle: Violation of locality of reference (of instructions) • Food for thought: Analogous violation of locality of reference (of data)?
Lexical Analysis - Implementation • Look-ahead • One character look-ahead is often enough • E.g. > and >= in C • Multi-character look-ahead required in some cases: • E.g. Distinguishing a (unreserved) keyword and identifier • Question: How many look-ahead chars. needed to scan Java expressions? Consider (>, >=, >>, >>=, >>>, >>>=) • Look-ahead strategy: • Most common look-ahead cases are one char. look-ahead cases • Just use a single character look-ahead in implementation • Special-case multi-character look-ahead
Lexical Analysis - Implementation • Use Buffered I/O • Reduced I/O time due to amortized latency • Scanner scans the buffer • How to handle end of buffer? • Partial token at end of buffer. • Look-ahead at end of buffer. • End of buffer may contain incomplete token and reading in a new buffer may block the scanner: • Use twin buffers as a circular queue. • Read Reference book (Aho, Sethi, Ullman) for details on Buffering Schemes
Lexical Analysis • Implementation (using tools/libraries) • E.g.1: Lex - a lexical analyzer generator • Given a set of tokens (as regular expressions), generates a scanner recognizing the tokens (as a C program, say) • Read Reference book (Aho, Sethi, Ullman) for buffering details. • E.g. 2: Use a Tokenizer • Primitive scanning - serves the purpose in limited implementations. • Read Java API (java.io.StreamTokenzier; java.util.StringTokenizer)
Lexical Analysis • Design Specification: Module (class in Java) with the interfaces (public methods in Java) • Token nextToken(); // returns the next token • boolean hasMoreTokens();