280 likes | 404 Vues
This resource provides an in-depth introduction to compilers, outlining their purpose in translating source code from high-level programming languages like Java and C++ to target code in lower-level formats such as machine code or assembly. It covers essential stages in compiler design, including lexical analysis, parsing, semantic analysis, and intermediate code generation. Additionally, it offers insights into syntax and semantics using BNF and EBNF, as well as examples of error checking during semantic analysis. Ideal for students and professionals aiming to grasp compiler functionality.
E N D
Introduction to Compilers CMSC 431 Shon Vick 01/28/02
What is a compiler? • Translates source code to target code • Source code is typically a high level programming language (Java, C++, etc) but does not have to be • Target code is often a low level language like assembly or machine code but does not have to be • Can you think of other compilers that you have used – according to this definition?
Other Compilers • Javadoc -> HTML • SQL Query output -> Table • Poscript -> PDF • High level description of a circuit -> machine instructions to fabricate circuit
The analysis Stage • Broken up into four phases • Lexical Analysis (also called scanning or tokenization) • Parsing • Semantic Analysis • Intermediate Code Generation
Lexing Example double d1; double d2; d2 = d1 * 2.0; double TOK_DOUBLE reserved word d1 TOK_ID variable name ; TOK_PUNCT has value of “;” double TOK_DOUBLE reserved word d2 TOK_ID variable name ; TOK_PUNCT has value of “;” d2 TOK_ID variable name = TOK_OPER has value of “=” d1 TOK_ID variable name * TOK_OPER has value of “*” 2.0 TOK_FLOAT_CONST has value of 2.0 ; TOK_PUNCT has value of “;” lexemes
Syntax and Semantics • Syntax - the form or structure of the expressions – whether an expression is well formed • Semantics – the meaning of an expression
Syntactic Structure • Syntax almost always expressed using some variant of a notation called a context-free grammar (CFG) or simply grammar • BNF • EBNF
A CFG has 4 parts • A set of tokens (lexemes), known as terminal symbols • A set of non-terminals • A set of rules (productions) where each production consists of a left-hand side (LHS) and a right-hand side (RHS) The LHS is a non-terminal and the RHS is a sequence of terminals and/or non-terminal symbols. • A special non-terminal symbol designated as the start symbol
An example of BNF syntax for real numbers <r> ::= <ds> . <ds> <ds> ::= <d> | <d> <ds> <d> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7| 8 | 9 < > encloses non-terminal symbols ::= 'is' or 'is made up of ' or 'derives' (sometimes denoted with an arrow ->) | or
Example • On the example from the previous slide: • What are the tokens? • What are the lexemes? • What are the non terminals? • What are the productions?
BNF Points • A non terminal can have more than RHS or an OR can be used • Lists or sequences are expressed via recursion • A derivation is just a repeated set of production (rule) applications • Examples
Example Grammar <program> -> <stmts> <stmts> -> <stmt> | <stmt> ; <stmts> <stmt> -> <var> = <expr> <var> -> a | b | c | d <expr> -> <term> + <term> | <term> - <term> <term> -> <var> | const
Example Derivation <program> => <stmts> => <stmt> => <var> = <expr> => a = <expr> => a = <term> + <term> => a = <var> + <term> => a = b + <term> => a = b + const
Parse Trees • Alternative representation for a derivation • Example parse tree for the previous example stmts stmt expr var = term term + a var const b
Another Example Expression -> Expression + Expression | Expression - Expression | ... Variable | Constant | ... Variable -> T_IDENTIFIER Constant -> T_INTCONSTANT | T_DOUBLECONSTANT
The Parse a + 2 Expression -> Expression + Expression -> Variable + Expression -> T_IDENTIFIER + Expression -> T_IDENTIFIER + Constant -> T_IDENTIFIER + T_INTCONSTANT
Parse Trees PS -> P | P PS P -> e | '(' PS ')' | '<' PS '>' | '[' PS ']' What’s the parse tree for this statement ? < [ ] [ < > ] >
EBNF - Extended BNF • Like BNF except that • Non-terminals start w/ uppercase • Parens are used for grouping terminals • Braces {} represent zero or more occurrences (iteration ) • Brackets [] represent an optional construct , that is a construct that appears either once or not at all.
EBNF example Exp -> Term { ('+' | '-') Term } Term -> Factor { ('*' | '/') Factor } Factor -> '(' Exp ')' | variable | constant
EBNF/BNF • EBNF and BNF are equivalent • How can {} be expressed in BNF? • How can ( ) be expressed? • How can [ ] be expressed?
Semantic Analysis • The syntactically correct parse tree (or derivation) is checked for semantic errors • Check for constructs that while valid syntax do not obey the semantic rules of the source language. • Examples: • Use of an undeclared/un-initialized variable • Function called with improper arguments • Incompatible operands and type mismatches,
Examples void fun1(int i); double d; d = fun1(2.1); int i; int j; i = i + 2; int arr[2], c; c = arr * 10; Most semantic analysis pertains to the checking of types.
Intermediate Code Generation • Where the intermediate representation of the source program is created. • The representation can have a variety of forms, but a common one is called three-address code (TAC) • Like assembly – the TAC is a sequence of simple instructions, each of which can have at most three operands.
Example _t1 = b * c _t2 = b * d _t3 = _t1 + _t2 a = _t3 a = b * c + b * d Note temps
Another Example _t1 = a > b if _t1 goto L0 _t2 = a - c a = _t2 L0: t3 = b * c c = _t3 if (a <= b) a = a - c; c = b * c; Note Temps Symbolic addresses
Next Time • Finish introduction to compilation stages • Read Aho/Sethi/Ullman Chapter 1
Selected References • Compilers Principles, Techniques and Tools, Aho, Sethi, and Ullman • http://www.stanford.edu/class/cs143/