4.1 Introduction Three approaches to implement programming languages

Chapter 4 Lexical and Syntax Analysis • 4.1 Introduction • Three approaches to implement programming languages • Compilation (C, C++, Compiler) • Pure interpretation (JavaScript, Interpreter) • Hybrid implementation (Java) • Compiler front end consists of • a Lexical Analyzer and a Syntax Analyzer • The Lexical Analyzer is the first phase of a compiler. Its main task is to read the input characters and produce as output a sequence of that the syntax analyzer uses for syntax analysis (dealing with names and numerical literals). • The Syntax Analyzer obtains a string of tokens from the lexical analyzer and verifies that the string can be generated by the grammar for the source language (dealing with expressions, statements, and so on).

Read character Pass token and its attributes Lexical Analyzer Parser input Push back character Syntax Analysis Lexical Analysis

4.2 Lexical Analysis • The lexical analyzer extract lexemes from the stream of input characters and produce the corresponding tokens. • 4.2.1 Example: the statement from the source code • sum = oldsum - value / 100; • would be grouped into the following tokens: • TokenLexeme • The identifier (IDENT) sum • The assignment symbol (ASSIGN_OP) = • The identifier (IDENT) oldsum • The minus sign (SUBTRACT_OP) - • The identifier (IDENT) value • The division sign (DIVISION) / • The number (INT_LIT) 100 • The semicolon (SEMICOLON) ;

4.2.2 Functions of Lexical Analysis • Removing White Space and Comments • “white space” (blanks, tabs, and newlines) • Recognizing Constants • Collect a sequence of digits into integers. • For example: the input 31 + 28 + 59 • Let num bethe token representing an integer. • The Lexical Analyzer should group 31, 28, 59 into the token num and pass num to the syntax analyzer, also pass these values along as attributes of num. • Recognizing Identifiers and Keywords • Identifiers are used as names of variables, arrays, functions, an so on. A grammar for a language treats an identifier as a token (id) • For example: the input count = count + increament • The Lexical Analyzer should convert this statement into • id = id + id • The Lexical Analyzer is also able to recognize the keywords reserved for the programming language.

4.3 A simpler lexical analyzer: • Only recognize program names, reserved words, and integer literals • Token patterns for names, reserved words, and integer literals. • Names: consists ofstrings of uppercase letters, lowercase letters, and digits, but begin with a letter. • Reserved words: use a lookup in a table of reserved words to determine which names are reserved words. • Integer literal: begin with ten different characters. • Subprograms of the lexical analyzer • getChar( ) gets the next character of from the input program, • puts its in a global variable (nextChar), determine the character class and puts it in charClass. • addChar( ) puts nextChar into lexeme (buffer) • lookup( ) check whether the current contents of lexeme is a reserved word or name

Finite automata: State diagrams of form are used for lexical analyzers. A state diagram to recognize names, reserved words, and integer literals

Implementation /* lex-a simple lexical analyzer */ int lex( ) { getChar( ); switch(charClass) { /* Parse identifiers and reserved words */ case LETTER: addChar( ); getChar( ); while (charClass == LETTER || charClass == DIGIT) { addChar( ); getChar( ); } return lookup(lexeme); break; /* Parse integer literals */ case DIGIT: addChar( ); getChar( ); while (charClass == DIGIT) { addChar( ) getChar( ) } return INT_LIT; break; }/* End of switch */ }/* End of function lex */

4.3 Syntax Analysis (Parsing) • Parsers for programming languages are used to construct parse trees for given programs. • Top-down Parsers: The tree is built from the root downward to the leaves. • Bottom-up Parsers: The tree is built from the leaves upward to the root. • Terminal symbols – Lowercase letters at the beginning of the alphabet (a, b, …) • Nonterminal symbols - Uppercase letters at the beginning of the alphabet (A, B, …) • Terminal or Nonterminals - Uppercase letters at the end of the alphabet (W, X, Y, Z) • Mixed strings (Terminal and/or Nonterminals) - Lowercase Greek letters (a, b, d, g)

S S S c c c A A A d d d • 4.4 Recursive-Descent Parsing (Top-down Parsing) • Example 1: Consider the grammar • S  cAd • A  ab | a • andthe input stringcad 1. Create a tree consisting of the root S 3. Point to a and expand the tree by first RHS of the production for A a b a 5. Go back to A, reset pointer to d and expand the tree generated by the second RHS of the production for A 2. Input cby a pointerand expand the tree by the first production for S 4. Point to d and it does not match tree generated by the first RHS of the production for A

Step 1: The top-down construction of a parse tree is done by starting with root, labeled with the starting nonterminal, and repeatedly performing the following the next two steps Step 2: At node n, labeled with nonterminal A, select one of the productions for A and construct children at n for the symbols on the right side of the production. Step 3: Find the next node at which a subtree is to be constructed. • Example 2: Consider the grammar • type  simple | id | array [ simple ] of type • simple  integer | char | num dotdot num • andthe input stringarray [ num dotdot num ] of integer

Implementation of the recursive-descent parser Each non-terminal has a subprogram For example: <expr>  <term> { (+ | -) <term> } <term>  <factor> { (*|/) <factor> } <factor>  id | ( <expr> )

The LL Grammar algorithms The first L in LL specifies a left-to-right scan of the input; the second L specifies that a leftmost derivation is generated. Problem 1: Left recursion Example: Approach Problem 2: Choice of the correct RHS Approach

4.5 Bottom-up Parsing Example: Consider the grammar S  aABe A  Abc | b B  d and the sentence abbcde S => aABe => aAde => aAbcde => abbcde

4.1 Introduction Three approaches to implement programming languages