350 likes | 479 Vues
This document presents an in-depth exploration of parsing expressions based on grammar definitions. It outlines the structure and implementation of parsing functions for arithmetic expressions, including recursion for terms and factors. The sample grammar showcases how to identify and parse tokens such as identifiers, parentheses, and arithmetic operators (+, -, *, /). The principles discussed are applicable in both Standard ML (SML) and C programming languages, with specific examples demonstrating recursive descent parsing techniques. This comprehensive guide serves as an educational resource for understanding the implementation of parsers.
E N D
Programming Language Concepts (CIS 635) Elsa L Gunter 4303 GITC NJIT, www.cs.njit.edu/~elsa/635
Sample Grammar <expr> ::= <term> | <term> + <expr> | <term> - <expr> <term> ::= <factor> | <factor> * <term> | <factor> / <term> <factor> ::= <id> | ( <expr> )
Tokens as SML Datatypes • + - * / ( ) <id> • Becomes an SML datatype datatype token = Id_token of string | Left_parenthesis | Right_parenthesis | Times_token | Divide_token | Plus_token | Minus_token
Parsing Token Streams • We will create three mutually recursive parsing functions: expr : (token option * (unit -> token option) -> (bool * (token option * (unit -> token option) term : (token option * (unit -> token option) -> (bool * (token option * (unit -> token option) factor : (token option * (unit -> token option) -> (bool * (token option * (unit -> token option)
Parsing an Expression <expr> ::= <term> [( + | - ) <expr> ] fun expr tokens = (case term tokens of ( true , tokens_after_term) => (case tokens_after_term of ( SOME Plus_token, tokens_after_plus) =>
Parsing a Plus Expression <expr> ::= <term> + <expr> fun expr tokens = (case term tokens of ( true , tokens_after_term) => (case tokens_after_term of ( SOME Plus_token , tokens_after_plus) =>
Parsing a Plus Expression <expr> ::= <term> + <expr> (case expr (tokens_after_plus(), tokens_after_plus) of ( true , tokens_after_expr) => ( true , tokens_after_expr)
Parsing a Plus Expression <expr> ::= <term> + <expr> (case expr (tokens_after_plus(), tokens_after_plus) of ( true, tokens_after_expr) => ( true , tokens_after_expr)
Parsing a Plus Expression <expr> ::= <term> + <expr> (case expr (tokens_after_plus(), tokens_after_plus) of ( true , tokens_after_expr) => ( true , tokens_after_expr)
What If No Expression After Plus <expr> ::= <term> + <expr> | ( false ,rem_tokens) => ( false , rem_tokens)) • Code for Minus_token is almost identical
What If No Plus or Minus <expr> ::= <term> | _ => ( true , tokens_after_term))
What if No Term expr> ::= <term> [( + | - ) <expr> ] | ( false , rem_tokens) => ( false , rem_tokens)) • Code for term is same as for expr except for replacing addition with multiplication and subtraction with division
Parsing Factor as Id <factor> ::= <id> and factor (SOME (Id_token id_name) , tokens) = ( true , (tokens(), tokens))
Parsing Factor as Parenthesized Expression <factor> ::= ( <expr> ) | factor (SOME Left_parenthesis , tokens) = (case expr (tokens(), tokens) of ( true , tokens_after_expr) =>
Parsing Factor as Parenthesized Expression <factor> ::= ( <expr> ) (case tokens_after_expr of ( SOME Right_parenthesis , tokens_after_rparen ) => ( true , (tokens_after_rparen(), tokens_after_rparen))
What if No Right Parenthesis <factor> ::= ( <expr> ) | _ => ( false , tokens_after_expr))
What If No Expression After Left Parenthesis <factor> ::= ( <expr> ) | ( false , rem_tokens) => ( false , rem_tokens))
What If No Id or Left Parenthesis <factor> ::= <id> | ( <expr> ) | factor tokens = ( false , tokens)
Parsing - in C • Assume global variable currentToken that holds the latest token removed from token stream • Assume subroutine lex( ) to analyze the character stream, find the next token at the head of that stream and update currentToken with that token • Assume subroutine error( ) to raise an exception
Parsing expr – in C <expr> ::= <term> [( + | - ) <expr> ] void expr ( ) { term ( ); if (nextToken == PLUS_CODE) { lex ( ); expr ( ); } else if (nextToken == MINUS_CODE) { lex ( ); expr ( );}
SML Code fun expr tokens = (case term tokens of ( true , tokens_after_term) => (case tokens_after_term of (SOME Plus_token,tokens_after_plus) => (case expr (tokens_after_plus(), tokens_after_plus) of ( true , tokens_after_expr) => ( true , tokens_after_expr)
Parsing expr – in C (optimized) <expr> ::= <term> [( + | - ) <expr> ] void expr ( ) { term( ); while (nextToken == PLUS_CODE || nextToken == MINUS_CODE) { lex ( ); term ( ); } }
Parsing factor – in C <factor> ::= <id> void factor ( ) { if (nextToken = ID_CODE) lex ( );
Parsing Factor as Id <factor> ::= <id> and factor (SOME (Id_token id_name) , tokens) = ( true , (tokens(), tokens))
Parsing factor – in C <factor> ::= ( <expr> ) else if (nextToken == LEFT_PAREN_CODE) { lex ( ); expr ( ); if (nextToken == RIGHT_PAREN_CODE) lex;
Comparable SML Code | factor (SOME Left_parenthesis , tokens) = (case expr (tokens(), tokens) of ( true , tokens_after_expr) => (case tokens_after_expr of ( SOME Right_parenthesis , tokens_after_rparen ) => ( true , (tokens_after_rparen(), tokens_after_rparen))
Parsing factor – in C else error ( ); /* Right parenthesis missing */ } else error ( ); /* Neither <id> nor ( was found at start */ }
Error cases in SML (* No right parenthesis *) | _ => ( false , tokens_after_expr)) (* No expression found *) | ( false , rem_tokens) => ( false , rem_tokens)) (* Neither <id> nor left parenthesis found *) | factor tokens = ( false , tokens)
Lexers – Simple Parsers • Lexers are parsers driven by regular grammars • Use character codes and arithmetic comparisons rather than case analysis to determine syntactic category for each character • Often some semantic action must be taken • Compute a number or build a string and record it in a symbol table
Example • <pos> = <digit> <pos> | <digit> • <digit> = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 fun digit c = (case Char.ord c of n => if n >= Char.ord #”0” andalso n <= Char.ord #”9” then SOME (n – Char.ord #”0”) else NONE)
Example fun pos [] = (NONE,[]) | pos (chars as ch::rem_chars) = (case digit ch of NONE => (NONE, chars) | SOME n => (case pos rem_chars of (NONE, more_chars) => (SOME (10,n), more_chars) | (SOME (p,m), more_chars) => (SOME (10*p,(p*n)+m), more_chars)))
Problems for Recursive-Descent Parsing • Left Recursion: A ::= Aw translates to a subroutine that loops forever • Indirect Left Recursion: A ::= Bw B ::= Av causes the same problem
Problems for Recursive-Descent Parsing • Parser must always be able to choose the next action based only only the next very next token • Pairwise disjointedness Test: Can we always determine which rule (in the non-extended BNF) to choose based on just the first token
Pairwise Disjointedness Test • For each rule A ::= y Calculate FIRST (y) = {a | y =>* aw} { | if y =>* } • For each pair of rules A ::= y and A ::= z, require FIRST(y) FIRST(z) = { } • Test too strong: Can’t handle <expr> ::= <term> [ ( + | - ) <expr> ]
Example Grammar: <S> ::= <A> a <B> b <A> ::= <A> b | b <B> ::= a <B> | a FIRST (<A> b) = {b} FIRST (b) = {b} Rules for <A> not pairwise disjoint