Programming Language Concepts (CIS 635)

Programming Language Concepts (CIS 635) Elsa L Gunter 4303 GITC NJIT, www.cs.njit.edu/~elsa/635

Parsing Token Streams • We will create three mutually recursive parsing functions: expr : (token option * (unit -> token option) -> (bool * (token option * (unit -> token option) term : (token option * (unit -> token option) -> (bool * (token option * (unit -> token option) factor : (token option * (unit -> token option) -> (bool * (token option * (unit -> token option)

Parsing an Expression <expr> ::= <term> [( + | - ) <expr> ] fun expr tokens = (case term tokens of ( true , tokens_after_term) => (case tokens_after_term of ( SOME Plus_token, tokens_after_plus) =>

Parsing a Plus Expression <expr> ::= <term> + <expr> fun expr tokens = (case term tokens of ( true , tokens_after_term) => (case tokens_after_term of ( SOME Plus_token , tokens_after_plus) =>

Parsing a Plus Expression <expr> ::= <term> + <expr> (case expr (tokens_after_plus(), tokens_after_plus) of ( true , tokens_after_expr) => ( true , tokens_after_expr)

Parsing a Plus Expression <expr> ::= <term> + <expr> (case expr (tokens_after_plus(), tokens_after_plus) of ( true, tokens_after_expr) => ( true , tokens_after_expr)

Parsing a Plus Expression <expr> ::= <term> + <expr> (case expr (tokens_after_plus(), tokens_after_plus) of ( true , tokens_after_expr) => ( true , tokens_after_expr)

What If No Expression After Plus <expr> ::= <term> + <expr> | ( false ,rem_tokens) => ( false , rem_tokens)) • Code for Minus_token is almost identical

What If No Plus or Minus <expr> ::= <term> | _ => ( true , tokens_after_term))

What if No Term expr> ::= <term> [( + | - ) <expr> ] | ( false , rem_tokens) => ( false , rem_tokens)) • Code for term is same as for expr except for replacing addition with multiplication and subtraction with division

Parsing Factor as Id <factor> ::= <id> and factor (SOME (Id_token id_name) , tokens) = ( true , (tokens(), tokens))

Parsing Factor as Parenthesized Expression <factor> ::= ( <expr> ) | factor (SOME Left_parenthesis , tokens) = (case expr (tokens(), tokens) of ( true , tokens_after_expr) =>

Parsing Factor as Parenthesized Expression <factor> ::= ( <expr> ) (case tokens_after_expr of ( SOME Right_parenthesis , tokens_after_rparen ) => ( true , (tokens_after_rparen(), tokens_after_rparen))

What if No Right Parenthesis <factor> ::= ( <expr> ) | _ => ( false , tokens_after_expr))

What If No Expression After Left Parenthesis <factor> ::= ( <expr> ) | ( false , rem_tokens) => ( false , rem_tokens))

What If No Id or Left Parenthesis <factor> ::= <id> | ( <expr> ) | factor tokens = ( false , tokens)

Parsing - in C • Assume global variable currentToken that holds the latest token removed from token stream • Assume subroutine lex( ) to analyze the character stream, find the next token at the head of that stream and update currentToken with that token • Assume subroutine error( ) to raise an exception

Parsing expr – in C <expr> ::= <term> [( + | - ) <expr> ] void expr ( ) { term ( ); if (nextToken == PLUS_CODE) { lex ( ); expr ( ); } else if (nextToken == MINUS_CODE) { lex ( ); expr ( );}

SML Code fun expr tokens = (case term tokens of ( true , tokens_after_term) => (case tokens_after_term of (SOME Plus_token,tokens_after_plus) => (case expr (tokens_after_plus(), tokens_after_plus) of ( true , tokens_after_expr) => ( true , tokens_after_expr)

Parsing expr – in C (optimized) <expr> ::= <term> [( + | - ) <expr> ] void expr ( ) { term( ); while (nextToken == PLUS_CODE || nextToken == MINUS_CODE) { lex ( ); term ( ); } }

Parsing factor – in C <factor> ::= <id> void factor ( ) { if (nextToken = ID_CODE) lex ( );

Parsing Factor as Id <factor> ::= <id> and factor (SOME (Id_token id_name) , tokens) = ( true , (tokens(), tokens))

Parsing factor – in C <factor> ::= ( <expr> ) else if (nextToken == LEFT_PAREN_CODE) { lex ( ); expr ( ); if (nextToken == RIGHT_PAREN_CODE) lex;

Comparable SML Code | factor (SOME Left_parenthesis , tokens) = (case expr (tokens(), tokens) of ( true , tokens_after_expr) => (case tokens_after_expr of ( SOME Right_parenthesis , tokens_after_rparen ) => ( true , (tokens_after_rparen(), tokens_after_rparen))

Parsing factor – in C else error ( ); /* Right parenthesis missing */ } else error ( ); /* Neither <id> nor ( was found at start */ }

Error cases in SML (* No right parenthesis *) | _ => ( false , tokens_after_expr)) (* No expression found *) | ( false , rem_tokens) => ( false , rem_tokens)) (* Neither <id> nor left parenthesis found *) | factor tokens = ( false , tokens)

Lexers – Simple Parsers • Lexers are parsers driven by regular grammars • Use character codes and arithmetic comparisons rather than case analysis to determine syntactic category for each character • Often some semantic action must be taken • Compute a number or build a string and record it in a symbol table

Example • <pos> = <digit> <pos> | <digit> • <digit> = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 fun digit c = (case Char.ord c of n => if n >= Char.ord #”0” andalso n <= Char.ord #”9” then SOME (n – Char.ord #”0”) else NONE)

Example fun pos [] = (NONE,[]) | pos (chars as ch::rem_chars) = (case digit ch of NONE => (NONE, chars) | SOME n => (case pos rem_chars of (NONE, more_chars) => (SOME (10,n), more_chars) | (SOME (p,m), more_chars) => (SOME (10*p,(p*n)+m), more_chars)))

Problems for Recursive-Descent Parsing • Left Recursion: A ::= Aw translates to a subroutine that loops forever • Indirect Left Recursion: A ::= Bw B ::= Av causes the same problem

Problems for Recursive-Descent Parsing • Parser must always be able to choose the next action based only only the next very next token • Pairwise disjointedness Test: Can we always determine which rule (in the non-extended BNF) to choose based on just the first token

Pairwise Disjointedness Test • For each rule A ::= y Calculate FIRST (y) = {a | y =>* aw}  { | if y =>* } • For each pair of rules A ::= y and A ::= z, require FIRST(y)  FIRST(z) = { } • Test too strong: Can’t handle <expr> ::= <term> [ ( + | - ) <expr> ]

Example Grammar: <S> ::= <A> a <B> b <A> ::= <A> b | b <B> ::= a <B> | a FIRST (<A> b) = {b} FIRST (b) = {b} Rules for <A> not pairwise disjoint

Programming Language Concepts (CIS 635)