Lexical Analysis

Lexical Analysis • Why split it from parsing? • Simplifies design • Parsers with whitespace and comments are more awkward • Efficiency • Only use the most powerful technique that works • And nothing more • No parsing sledgehammers for lexical nuts • Portability • More modular code • More code re-use

Source Code Characteristics • Code • Identifiers • Count, max, get_num • Language keywords: reserved or predefined • switch, if .. then.. else, printf, return, void • Mathematical operators • +, *, >> …. • <=, =, != … • Literals • “Hello World” • Comments • Whitespace

Reserved words versus predefined identifiers • Reserved words cannot be used as the name of anything in a definition (i.e., as an identifier). • Predefined identifiers have special meanings, but can be redefined (although they probably shouldn’t). • Examples of predefined identifiers in Java:anything in java.lang package, such as String, Object, System, Integer.

Language of Lexical Analysis Tokens: category Patterns: regular expression Lexemes:actual string matched

Tokens are not enough… • Clearly, if we replaced every occurrence of a variable with a token then …. We would lose other valuable information (value, name) • Other data items are attributes of the tokens • Stored in the symbol table

Token delimiters • When does a token/lexeme end? e.g xtemp=ytemp

Ambiguity in identifying tokens • A programming language definition will state how to resolve uncertain token assignment • <> Is it 1 or 2 tokens? • Reserved keywords (e.g. if) take precedence over identifiers (rules are same for both) • Disambiguating rules state what to do • ‘Principle of longest substring’: greedy match

Regular Expressions • To represent patterns of strings of characters • REs • Alphabet – set of legal symbols • Meta-characters – characters with special meanings •  is the empty string • 3 basic operations • Choice – choice1|choice2, • a|b matches either a or b • Concatenation – firstthing secondthing • (a|b)c matches the strings { ac, bc } • Repetition (Kleene closure)– repeatme* • a* matches { , a, aa, aaa, aaaa, ….} • Precedence: * is highest, | is lowest • Thus a|bc* is a|(b(c*))

Regular Expressions… • We can add in regular definitions • digit = 0|1|2 …|9 • And then use them: • digit digit* • A sequence of 1 or more digits • One or more repetitions: • (a|b)(a|b)*  (a|b)+ • Any character in the alphabet . • .*b.* - strings containing at least one b • Ranges [a-z], [a-zA-Z], [0-9], (assume character set ordering) • Not: ~a or [^a]

SKIP : { " " | "\r" | "\t" } TOKEN : { < EOL: "\n" > } TOKEN : /* OPERATORS */ { < PLUS: "+" > | < MINUS: "-" > | < MULTIPLY: "*" > | < DIVIDE: "/" > } TOKEN : { < FLOAT: ( (<DIGIT>)+ "." ( <DIGIT> )* ) | ( "." ( <DIGIT> )+ ) > | < INTEGER: (<DIGIT>)+ > | < #DIGIT: ["0" - "9"]> } TOKEN : { < TYPE: ("float"|"int")> | < IDENTIFIER: ( (["a"-"z"]) | (["A"-"Z"]) ) ( (["a"-"z"]) | (["A"-"Z"]) | (["0"-"9"]) | "_" )*> } In JavaCC, you specify your tokens using regular expression.

Some exercises • Describe the languages denoted by the following regular expressions • 0 ( 0 | 1 ) * 0 • ( ( 11 | 0 ) * ) * • 0* 1 0* 1 0* 1 0 * • Write regular definitions for the following regular expressions • All strings that contain the five vowels in order (but not necessarily adjacent) aabcaadggge is okay • All strings of letters in which the letters are in ascending lexicographic order • All strings of 0’s and 1’s that do not contain the substring 011

Limitations of REs • REs can describe many language constructs but not all • For example Alphabet = {a,b}, describe the set of strings consisting of a single a surrounded by an equal number of b’s S= {a, bab, bbabb, bbbabbb, …} • For example, nested Tags in HTML

Lookahead • <=, <>, < • When we read a token delimiter to establish a token we need to make sure that it is still available as part of next token • It is the start of the next token! • This is lookahead • Decide what to do based on the character we ‘haven’t read’ • Sometimes implemented by reading from a buffer and then pushing the input back into the buffer • And then starting with recognizing the next token

Classic Fortran example • DO 99 I=1,10 becomes DO99I=1,10 versus DO99I=1.10 The first is a do loop, the second an assignment. We need lots of lookahead to distinguish. • When can the lexical analyzer assign a token? Push back into input buffer • or ‘backtracking’

Finite Automata • A recognizer determines if an input string is a sentence in a language • Uses a regular expression • Turn the regular expression into a finite automaton • Could be deterministic or non-deterministic

Transition diagram for identifiers • RE • Identifier -> letter (letter | digit)* letter accept start letter other 0 1 2 digit

An NFA is similar to a DFA but it also permits multiple transitions over the same character and transitions over  . In the case of multiple transitions from a state over the same character, when we are at this state and we read this character, we have more than one choice; the NFA succeeds if at least one of these choices succeeds. The  transition doesn't consume any input characters, so you may jump to another state for free. • Clearly DFAs are a subset of NFAs. But it turns out that DFAs and NFAs have the same expressive power.

From a Regular Expression to an NFAThompson’s Construction (a | b)* abb e a 2 3 e e start e e a b b 0 1 6 7 8 9 10 e e 4 5 accept b e

a start a b b accept 0 1 2 3 b Non-deterministic finite state automata NFA b a start b b a accept 0 01 02 03 a b a Equivalent deterministic finite state automata DFA

NFA -> DFA (subset construction) • We can covert from an NFA to a DFA using subset construction. • To perform this operation, let us define two functions: • The -closure function takes a state and returns the set of states reachable from it based on (one or more)  -transitions. Note that this will always include the state tself. We should be able to get from a state to any state in its -closure without consuming any input. • The function move takes a state and a character, and returns the set of states reachable by one transition on this character. • We can generalize both these functions to apply to sets of states by taking the union of the application to individual states. • Eg. If A, B and C are states, move({A,B,C},à') = move(A,à')  move(B,à') move(C,à').

NFA -> DFA (cont) • The Subset Construction Algorithm • Create the start state of the DFA by taking the -closure of the start state of the NFA. • Perform the following for the new DFA state: For each possible input symbol: • Apply move to the newly-created state and the input symbol; this will return a set of states. • Apply the -closure to this set of states, possibly resulting in a new set. • This set of NFA states will be a single state in the DFA. • Each time we generate a new DFA state, we must apply step 2 to it. The process is complete when applying step 2 does not yield any new states. • The finish states of the DFA are those which contain any of the finish states of the NFA.

Transition Table (DFA) Input Symbol

Writing a lexical analyzer • The DFA helps us to write the scanner. • Figure 4.1 in your text gives a good example of what a scanner might look like.

LEX (FLEX) • Tool for generating programs which recognize lexical patterns in text • Takes regular expressions and turns them into a program

Lexical Errors • Only a small percentage of errors can be recognized during Lexical Analysis Consider if (good == “bad)

Examples from the PERL language • Line ends inside literal string • Illegal character in input file • missing semi-colon • missing operator • missing paren • unquoted string • unopened file handle

In general • What does a lexical error mean? • Strategies for dealing with: • “Panic-mode” • Delete chars from input until something matches • Inserting characters • Re-ordering characters • Replacing characters • For an error like “illegal character” then we should report it sensibly

Syntax Analysis • also known as Parsing • Grouping together tokens into larger structures • Analogous to lexical analysis • Input: • Tokens (output of Lexical Analyzer) • Output: • Structured representation of original program

A Context Free Grammar • A grammar is a four tuple (, N,P,S) where •  is the terminal alphabet • N is the non terminal alphabet • P is the set of productions • S is a designated start symbol in N

Parsing • Need to express series of added operands • Expression number plus Expression | number • Similar to regular definitions: • Concatenation • Choice • No Kleene closure – repetition by recursion Expression  number Operator Expression operator  + | - | * | /

BNF Grammar Expression number Operator number Operator + | - | *| / Structure on the left is defined to consist of the choices on the right hand side Meta-symbols:  | Different conventions for writing BNF Grammars: <expression> ::= number <operator> number Expression number Operator number

Derivations • Derivation: • Sequence of replacements of structure names by choices on the RHS of grammar rules • Begin: start symbol • End: string of token symbols • Each step one replacement is made Exp  Exp Op Exp | number Op + | - | * | /

Example Derivation Note the different arrows:  Derivation applies grammar rules  Used to define grammar rules Non-terminals: Exp, Op Terminals: number, * Terminals: because they terminate the derivation

E  ( E ) | a • What sentences does this grammar generate? An example derivation: • E  ( E ) ((E)) ((a)) • Note that this is what we couldn’t achieve with regular definitions

Recursive Grammars • At seats, try using grammar to generate • anbn • E  E  |  • derives ,  ,   ,     ,      …. • All strings beginning with  followed by zero or more repetitions of  •  *

Given the grammar rules shown below, derive the sentence This is the house that Jack built. Draw the parse tree, labeling your subtrees with the numbers of the grammar rules used to derive them. • Grammar rules: • S → NP VP • NP → NP REL S | PRO | N | ART N | NAME • VP → V | V NP • N → house • PRO →this |that • REL→ that • ART→ the • V→ built | is • NAME → Jack Which is easier – bottom up or top down?

Parse Trees & Derivations • Leafs = terminals • Interior nodes = non-terminals • If we replace the non-terminals right to left • The parse tree sequence is right to left • A rightmost derivation -> reverse post-order traversal • If we derive left to right: • A leftmost derivation • pre-order traversal • parse trees encode information about the derivation process

Abstract Syntax Trees Parse trees contain surplus information Parse Tree Abstract Syntax Tree + exp 3 4 exp op exp This is all the information we actually need Token sequence number + number 3 4

An exercise • Consider the grammar S->(L) | a L->L,S |S • What are the terminals, nonterminals and start symbol • Find leftmost and rightmost derivations and parse trees for the following sentences • (a,a) • (a, (a,a)) • (a, ((a,a), (a,a)))

Parsing token sequence: id + id * id E  E + E | E * E | ( E ) | - E | id How many ways can you find a tree which matches the expression?

Example of Ambiguity • Grammar: expr®expr+expr |exprexpr | (expr ) | NUMBER • Expression: 2 + 3 * 4 • Parse trees:

Ambiguity • If a sentence has two distinct parse trees, the grammar is ambiguous • Or alternatively:is ambiguous if there are two different right-most derivations for the same string. • In English, the phrase ``small dogs and cats'' is ambiguous as we aren't sure if the cats are small or not. • `I see flying planes' is also ambiguous • A language is said to be ambiguous if no unambiguous grammar exists for it. • Dance is at the old main gym. How it is parsed?

Ambiguous Grammars • Problem – no clear structure is expressed • A grammar that generates a string with 2 distinct parse trees is called an ambiguous grammar • 2+3*4 = 2 + (3*4) = 14 • 2+3*4 = (2+3) * 4 = 20 • How does the grammar relate to meaning? • Our experience of math says interpretation 1 is correct but the grammar does not express this: • E  E + E | E * E | ( E ) | - E | id

Removing Ambiguity Two methods • Disambiguating Rules The basic notion is to write grammar rules of the form • expr : expr OP expr and • expr : UNARY expr for all binary and unary operators desired. • This creates a very ambiguous grammar with many parsing conflicts. • You specify as disambiguating rules the precedence of all the operators and the associativity of the binary operators. positives: leaves grammar unchanged negatives: grammar is not sole source of syntactic knowledge

Removing Ambiguity Two methods 2. Rewrite the Grammar Using knowledge of the meaning that we want to use later in the translation into object code to guide grammar alteration

Sometimes we can remove ambiguity from a grammar by by restructuring the productions, but sometimes the language is inherently ambiguous. • For example, L={aibjck|i=j or j=k for i,j,k>=1} • An ambiguous grammar to generate this language is shown below:

Precedence E  E addop Term | Term Addop  + | - Term  Term * Factor | Term/Factor |Factor Factor  ( exp ) | number | id • Operators of equal precedence are grouped together at the same ‘level’ of the grammar  ’precedence cascade’ • The lowest level operators have highest precedence • (The first shall be last and the last shall be first.)

Associativity • 45-10-5 ?30 or 40 Subtraction is left associative, left to right (=30) • E  E addop E | TermDoes not tell us how to split up 45-10-5 • E  E addop Term | TermForces left associativity via left recursion • Precedence & associativity remove ambiguity of arithmetic expressions • Which is what our math teachers took years telling us!

Extended BNF Notation • Notation for repetition and optional features. • {…} expresses repetition:expr®expr+term | term becomesexpr®term { +term } • […] expresses optional features:if-stmt® if(expr)stmt | if(expr)stmtelsestmtbecomesif-stmt® if(expr)stmt [ elsestmt ]

Notes on use of EBNF • Use {…} only for left recursive rules:expr®term+expr | termshould become expr®term [ +expr ] • Do not start a rule with {…}: writeexpr®term { +term }, notexpr® { term+ } term • Exception to previous rule: simple token repetition, e.g. expr® { - } term … • Square brackets can be used anywhere, however:expr®expr+term | term | unaryop termshould be written asexpr® [ unaryop ] term { +term }

Lexical Analysis