590 likes | 917 Vues
JavaCC. CMSC 431 Spring 04. What is a parser generator. Scanner. Parser. assignment . Total. =. Expr . Parser generator (JavaCC). id + id. price . tax . lexical+grammar specification. JavaCC.
E N D
JavaCC CMSC 431 Spring 04
What is a parser generator Scanner Parser assignment Total = Expr Parser generator (JavaCC) id + id price tax lexical+grammar specification
JavaCC • JavaCC (Java Compiler Compiler) is a scanner and parser generator – its unusual in this regard; • Produce a scanner and/or a parser written in java, itself is also written in Java; • There are many parser generators. • yacc (Yet Another Compiler-Compiler) for C programming language (See Dragon book chapter 4.9); • Bison from gnu.org • There are also many parser generators written in Java • JavaCUP; We’ll look at this one latter • ANTLR; • SableCC
More on classification of java parser generators • Bottom up Parser Generators Tools • JavaCUP; • jay, YACC for Java www.inf.uos.de/bernd/jay • SableCC, The Sable Compiler Compiler www.sablecc.org • Topdown Parser Generators Tools • ANTLR, Another Tool for Language Recognition www.antlr.org • JavaCC, Java Compiler Compiler www.webgain.com/java_cc
Features of JavaCC • TopDown LL(K) parser genrator • Lexical and grammar specifications in one file • Tree Building preprocessor • with JJTree • Extreme Customizable • many different options selectable • Document Generation • by using JJDoc • Internationalized • can handle full unicode • Syntactic and Semantic lookahead
Features of JavaCC (cont’d) • Permits extneded BNF specifications • can use | * ? + () at RHS. • Lexical states and lexical actions • Case-insensitive lexical analysis • Extensive debugging capability • Special tokens • Very good error reporting
JavaCC Installation • Download the file javacc-3.X.zip from https://javacc.dev.java.net/ • Follow the link this says Download or go directly to https://javacc.dev.java.net/servlets/ProjectDocumentList • unzip javacc-3.X.zip to a directory %JCC_HOME% • add %JCC_HOME\bin directory to your %path%. • javacc, jjtree, jjdoc may now be invoked directly from the command line.
Steps to use JavaCC • Write a javaCC specification (.jj file) • Defines the grammar and actions in a file (say, calc.jj) • Run javaCC to generate a scanner and a parser • javacc calc.jj • Will generate parser, scanner, token,… java sources • Write your program that uses the parser • For example, UseParser.java • Compile and run your program • javac -classpath . *.java • java -cp . mainpackage.MainClass
Example 1 • Grammar : re.jj • Example • % all strings ending in "ab" • (a|b)*ab; • aba; • ababb; • Our tasks: • For each input string (Line 3,4) determine whether it matches the regular expression (line 2). Parse a spec of regular expressions and match it with input strings
tokens REParserTokenManager REParser MainClass result javaCC re.jj the overall picture % comment (a|b)*ab; a; ab;
Format of a JavaCC input Grammar • javacc_options • PARSER_BEGIN ( <IDENTIFIER>1 ) java_compilation_unit PARSER_END ( <IDENTIFIER>2 ) • ( production )*
the input spec file (re.jj) options { USER_TOKEN_MANAGER=false; BUILD_TOKEN_MANAGER=true; OUTPUT_DIRECTORY="./reparser"; STATIC=false; }
re.jj PARSER_BEGIN(REParser) package reparser; import java.lang.*; … import dfa.*; public class REParser { public FA tg = new FA(); // output error message with current line number public static void msg(String s) { System.out.println("ERROR"+s); } public static void main(String args[]) throws Exception { REParser reparser = new REParser(System.in); reparser.S(); } } PARSER_END(REParser)
re.jj (Token definition) TOKEN : { <SYMBOL: ["0"-"9","a"-"z","A"-"Z"] > | <EPSILON: "epsilon" > | <LPAREN: "(“ > | <RPAREN: ")“ > | <OR: "|" > | <STAR: "*“ > | <SEMI: ";“ > } SKIP: { < ( [" ","\t","\n","\r","\f"] )+ > | < "%" ( ~ ["\n"] )* "\n" > { System.out.println(image); } }
re.jj (productions) void S() : { FA d1; } { d1 = R() <SEMI> { tg = d1; System.out.println("------NFA"); tg.print(); System.out.println("------DFA"); tg = tg.NFAtoDFA(); tg.print(); System.out.println("------Minimize"); tg = tg.minimize(); tg.print(); System.out.println("------Renumber"); tg=tg.renumber(); tg.print(); System.out.println("------Execute"); } testCases() }
re.jj void testCases() : {} { (testCase() )+ } void testCase(): { String testInput ;} { testInput = symbols() <SEMI> { tg.execute( testInput) ; } } String symbols() : {Token token = null; StringBuffer result = new StringBuffer(); } { ( token = <SYMBOL> { result.append( token.image) ; } )* { return result.toString(); } }
re.jj (regular expression) // R --> RUnit | RConcat | RChoice FA R() : {FA result ;} { result = RChoice() { return result; } } FA RUnit() : { FA result ; Token d1; } { ( <LPAREN> result = RChoice() <RPAREN> | <EPSILON> { result = tg.epsilon(); } | d1 = <SYMBOL> { result = tg.symbol( d1.image ); } ) { return result ; } }
re.jj FA RChoice() : { FA result, temp ;} { result = RConcat() (<OR> temp = RConcat() { result = result.choice( temp ) ;} )* {return result ; } } FA RConcat() : { FA result, temp ;} { result = RStar() ( temp = RStar() { result = result.concat( temp ) ;} )* {return result ; } } FA RStar() : {FA result;} { result = RUnit() (<STAR> { result = result.closure(); } )* { return result; } }
Format of a JavaCC input Grammar javacc_input ::=javacc_options PARSER_BEGIN (<IDENTIFIER>1) java_compilation_unit PARSER_END ( <IDENTIFIER>2 ) ( production )* <EOF> color usage: • blue --- nonterminal • <orange> – a token type • purple --- token lexeme ( reserved word; • I.e., consisting of the literal itself.) • black -- meta symbols
Notes • <IDENTIFIER> means any Java identifers like var, class2, … • IDENTIFIER means IDENTIFIER only. • <IDENTIFIER>1 must = <IDENTIFIER>2 • java_compilation_unit is any java code that as a whole can appear legally in a file. • must contain a main class declaration with the same name as <IDENTIFIER>1 . • Ex: PARSER_BEGIN ( MyParser ) package mypackage; import myotherpackage….; public class MyParser { … } class MyOtherUsefulClass { … } … PARSER_END (MyParser)
The input and output of javacc (MyLangSpec.jj ) javacc Token.java PARSER_BEGIN ( MyParser ) package mypackage; import myotherpackage….; public class MyParser { … } class MyOtherUsefulClass { … } … PARSER_END (MyParser) ParserError.java MyParser.java MyParserTokenManager.java MyParserCostant.java
Notes: • Token.java and ParseError.jar are the same for all input and can be reused. • package declaration in *.jj are copied to all 3 outputs. • import declarations in *.jj are copied to the parser and token manager files. • parser file is assigned the file name <IDENTIFIER>1 .java • The parser file has contents: …class MyParser { … //generated parser is inserted here. … } • The generated token manager provides one public method: Token getNextToken() throws ParseError;
javacc options javacc_options ::= [ options{ ( option_binding )* } ] • option_binding are of the form : • <IDENTIFIER>3=<java_literal>; • where <IDENTIFIER>3 is not case-sensitive. • Ex: options { USER_TOKEN_MANAGER=true; BUILD_TOKEN_MANAGER=false; OUTPUT_DIRECTORY="./sax2jcc/personnel"; STATIC=false; }
More Options • LOOKAHEAD • java_integer_literal (1) • CHOICE_AMBIGUITY_CHECK • java_integer_literal (2) for A | B … | C • OTHER_AMBIGUITY_CHECK • java_integer_literal (1) for (A)*, (A)+ and (A)? • STATIC (true) • DEBUG_PARSER (false) • DEBUG_LOOKAHEAD (false) • DEBUG_TOKEN_MANAGER (false) • OPTIMIZE_TOKEN_MANAGER • java_boolean_literal (false) • OUTPUT_DIRECTORY (current directory) • ERROR_REPORTING (true)
More Options • JAVA_UNICODE_ESCAPE (false) • replace \u2245 to actual unicode (6 char 1 char) • UNICODE_INPUT (false) • input strearm is in unicode form • IGNORE_CASE (false) • USER_TOKEN_MANAGER (false) • generate TokenManager interface for user’s own scanner • USER_CHAR_STREAM (false) • generate CharStream.java interface for user’s own inputStream • BUILD_PARSER (true) • java_boolean_literal • BUILD_TOKEN_MANAGER (true) • SANITY_CHECK (true) • FORCE_LA_CHECK (false) • COMMON_TOKEN_ACTION (false) • invoke void CommonTokenAction(Token t) after every getNextToken() • CACHE_TOKENS (false)
Example: Figure 2.2 • if IF • [a-z][a-z0-9]* ID • [0-9]+ NUM • ([0-9]+”.”[0-9]*) | ([0-9]*”.”[0-9]+) REAL • (“--”[a-z]*”\n”) | (““|”\n” | “\t” )+ nonToken, WS • . error • javacc notations • “if”or“i”“f”or [“i”][“f”] • [“a”-”z”]([“a”-”z”,”0”-”9”])* • ([“0”-”9”])+ • ([“0”-”9”])+ “.” ( [“0”-”9”] ) * | ([“0”-”9”])* ”.” ([“0”-”9”])+
JvaaCC Spec for Some Tokens PARSER_BEGIN(MyParser) class MyParser{} PARSER_END(MyParser) /* For the regular expressin on the right, the token on the left will be returned */ TOKEN : { < IF: “if” > | < #DIGIT: [“0”-”9”] > |< ID: [“a”-”z”] ( [“a”-”z”] | <DIGIT>)* > |< NUM: (<DIGIT>)+ > |< REAL: ( (<DIGIT>)+ “.” (<DIGIT>)* ) | ( <DIGIT>+ “.” (<DIGIT>)* ) > }
Continued /* The regular expression here will be skipped during lexical analysis */ SKIP : { < ““> | <“\t”> |<“\n”> } /* like SKIP but skipped text accessible from parser action */ SPECIAL_TOKEN : { <“--” ([“a”-”z”])* (“\n” | “\r” | “\n\r” ) > } /* . For any substring not matching lexical spec, javacc will throw an error */ /* main rule */ void start() : {} { (<IF> | <ID> |<NUM> |<REAL>)* }
The Form of a Production java_return_typejava_identifier(java_parameter_list) : java_block {expansion_choices} • EX : void XMLDocument(Logger logger): { int msg = 0; } { <StartDoc> { print(token); } Element(logger) <EndDoc> { print(token); } | else() }
Example ( Grammar 3.30 ) • P L • S id := id • S while id do S • S begin L end • S if id then S • S if id then S else S • L S • L L;S 1,7,8 : P S (;S)*
JavaCC Version of Grammar 3.30 PARSER_BEGIN(MyParser) pulic class MyPArser{} PARSRE_END(MyParser) SKIP : {““ | “\t” | “\n” } TOKEN: { <WHILE: “while”> | <BEGIN: “begin”> | <END:”end”> | <DO:”do”> | <IF:”if”> | <THEN : “then”> | <ELSE:”else”> | <SEMI: “;”> | <ASSIGN: “=“> |<#LETTER: [“a”-”z”]> | <ID: <LETTER>(<LETTER> | [“0”-”9”] )* > }
JavaCC Version of Grammar 3.30 (cont’d) void Prog() : { } { StmList() <EOF> } void StmList(): { } { Stm() (“;” Stm() ) * } void Stm(): { } { <ID> “=“ <ID> | “while” <ID> “do” Stm() | <BEGIN> StmList() <END> | “if” <ID> “then” Stm() [ LOOKAHEAD(1) “else” Stm() ] }
Types of productions • production ::= javacode_production | regulr_expr_production | bnf_production | token_manager_decl Note: 1,3 are used to define grammar. 2 is used to define tokens 4 is used to embed code into token manager.
JAVACODE production • javacode_production ::= “JAVACODE” java-return_type iava_id “(“ java_param_list “)” java_block • Note: • Used to define nonterminals for recognizing sth that is hard to parse using normal production.
Example JAVACODE JAVACODE void skip_to_matching_brace() { Token tok; int nesting = 1; while (true) { tok = getToken(1); if (tok.kind == LBRACE) nesting++; if (tok.kind == RBRACE) { nesting--; if (nesting == 0) break; } tok = getNextToken(); } }
Note: • Do not use nonterminal defined by JAVACODE at choice point without giving LOOKHEAD. • void NT() : {} { skip_to_matching_brace() | some_other_production() } • void NT() : {} { "{" skip_to_matching_brace() | "(" parameter_list() ")" }
TOKEN_MANAGER_DECLS token_manager_decls ::= TOKEN_MGR_DECLS :java_block • The token manager declarations starts with the reserved word "TOKEN_MGR_DECLS" followed by a ":" and then a set of Java declarations and statements (the Java block). • These declarations and statements are written into the generated token manager (MyParserTokenManager.java) and are accessible from within lexical actions. • There can only be one token manager declaration in a JavaCC grammar file.
regular_expression_production regular_expr_production ::= [ lexical_state_list ] regexpr_kind [ [IGNORE_CASE] ] : {regexpr_spec ( |regexpr_spec )* } • regexpr_kind::= TOKEN | SPECIAL_TOKEN | SKIP | MORE • TOKEN is used to define normal tokens • SKIP is used to define skipped tokens (not passed to later parser) • MORE is used to define semi-tokens (I.e. only part of a token). • SPECIAL_TOKEN is between TOKEN and SKIP tokens in that it is passed on to the parser and accessible to the parser action but is ignored by production rules (not counted as an token). Useful for representing comments.
lexical_state_list lexical_state_list::= < * > | <java_identifier ( ,java_identifier )* > • The lexical state list describes the set of lexical states for which the corresponding regular expression production applies. • If this is written as "<*>", the regular expression production applies to all lexical states. Otherwise, it applies to all the lexical states in the identifier list within the angular brackets. • if omitted, then a DEFAULT lexical state is assumed.
regexpr_spec regexpr_spec::= regular_expression1 [ java_block ] [ :java_identifier ] • Meaning: • When a regular_expression1 is matched then • if java_block exists then execute it • if java_identifier appears, then transition to that lexical state.
regular_expression regular_expression ::= java_string_literal | < [ [#] java_identifier: ] complex_regular_expression_choices> | <java_identifier> | <EOF> • <EOF> is matched by end-of-file character only. • (3) <java_identifier> is a reference to other labeled regular_expression. • used in bnf_production • java_string_literal is matched only by the string denoted by itself. • (2) is used to defined a labled regular_expr and not visible to outside the current TOKEN section if # occurs. • (1) for unnamed tokens
Example <DEFAULT, LEX_ST2> TOKEN IGNORE_CASE : { < FLOATING_POINT_LITERAL: (["0"-"9"])+ "." (["0"-"9"])* (<EXPONENT>)? (["f","F","d","D"])? | "." (["0"-"9"])+ (<EXPONENT>)? (["f","F","d","D"])? | (["0"-"9"])+ <EXPONENT> (["f","F","d","D"])? | (["0"-"9"])+ (<EXPONENT>)? ["f","F","d","D"] > { // do Something } : LEX_ST1 | < #EXPONENT: ["e","E"] (["+","-"])? (["0"-"9"])+ > } • Note: if # is omitted, E123 will be recognized erroneously as a token of kind EXPONENT.
Structure of complex_regular_expression • complex_regular_expression_choices::= complex_regular_expression (| complex_regular_expression )* • complex_regular_expression ::= ( complex_regular_expression_unit )* • complex_regular_expression_unit ::= java_string_literal| "<" java_identifier ">" | character_list | (complex_regular_expression_choices) [+|*|?] • Note: unit concatenation;juxtaposition complex_regular_expression choice; | complex_regular_expression_choice (.)[+|*|?] unit
character_list character_list::= [~] [ [ character_descriptor ( ,character_descriptor )* ] ] character_descriptor::= java_string_literal [ -java_string_literal ] java_string_literal ::= // reference to java grammar “singleCharString* “ note:java_sting_literalhere is restricted to length 1. ex: • ~[“a”,”b”] --- all chars but a and b. • [“a”-”f”, “0”-”9”, “A”,”B”,”C”,”D”,”E”,”F”] --- hexadecimal digit. • [“a”,”b”]+ is not a regular_expression_unit. Why ? • should be written ( [“a”,”b”] )+ instead.
bnf_production • bnf_production::= java_return_typejava_identifier "(" java_parameter_list ")" ":" java_block "{" expansion_choices "}“ • expansion_choices::= expansion ( "|" expansion )* • expansion::= ( expansion_unit )*
expansion_unit • expansion_unit::= local_lookahead | java_block | "(" expansion_choices ")" [ "+" | "*" | "?" ] | "[" expansion_choices "]" | [ java_assignment_lhs "=" ] regular_expression | [ java_assignment_lhs "=" ] java_identifier "(" java_expression_list ")“ Notes: 1 is for lookahead; 2 is for semantic action 4 = ( …)? 5 is for token match 6. is for match of other nonterminal
lookahead • local_lookahead::= "LOOKAHEAD" "(" [ java_integer_literal ] [ "," ] [ expansion_choices ] [ "," ] [ "{" java_expression "}" ] ")“ • Notes: • 3 componets: max # lookahead + syntax + semantics • examples: • LOOKHEAD(3) • LOOKAHEAD(5, Expr() <INT> | <REAL> , { true} ) • More on LOOKAHEAD • see minitutorial
JavaCC API • Non-Terminals in the Input Grammar • NT is a nonterminal => returntype NT(parameters) throws ParseError; is generated in the parser class • API for Parser Actions • Token token; • variable always holds the last token and can be used in parser actions. • exactly the same as the token returned by getToken(0). • two other methods - getToken(int i) and getNextToken() can also be used in actions to traverse the token list.