1 / 47

Foundations of Software Design

Foundations of Software Design. Lecture 24: Compilers, Lexers, and Parsers; Intro to Graphs Marti Hearst Fall 2002. Programming Languages. Compiler. Assembly Language. CPU. Address Space. Circuits. Code vs. Data. Gates. Orders of Magnitude. Boolean Logic. Number Systems.

mairi
Télécharger la présentation

Foundations of Software Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Foundations of Software Design Lecture 24: Compilers, Lexers, and Parsers; Intro to Graphs Marti Hearst Fall 2002

  2. Programming Languages Compiler Assembly Language CPU Address Space Circuits Code vs. Data Gates Orders of Magnitude Boolean Logic Number Systems How Do Computers Work (Revisited)? Machine Instructions Bits & Bytes Binary Numbers

  3. Compiler The Compiler • What is a compiler? • A recognizer (of some source language L). • A translator (of programs written in L into programs written in some object or target language L'). • A compiler is itself a program, written in some host language • Operates in phases Programming Languages Assembly Language Machine Instructions Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  4. Converting Java to Byte Code • When you compile a java program, javac produces byte codes (stored in the class file). • The byte codes are not converted to machine code. • Instead, they are interpreted in the VM when you run the program called java.

  5. C code Translated by the C compiler (gcc or cc) Assembly Language Creates the JVM once Machine Code Java code Translated by the java compiler (javac or jit) Java Virtual Machine Byte code (class file) Individual program is loaded & run in JVM

  6. Compiler Compilers • Which came first: the compiler or the program? • The very first one has to be written in assembly language! • This is why most programming languages today start with the C code generator • After you have created the first compiler for a given language, say java, then you … • Use that compiler to compile itself!!

  7. Compiling Your Compiler Write the first java compiler using C Write the second java compiler using java Compile using gcc Compile using javac Javac in C Javac in java Write other java programs Compile using javac

  8. Compiler in more detail. Lexical analyzer (scanner) Syntax analyzer (parser) Semantic analyzer Intermediate Code Generator Optimizer Code Generator Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  9. The Scanner • Task: • Translate the sequence of characters into a corresponding sequence of tokens (by grouping characters into lexemes). • How it’s done • Specify lexemes using Regular Expressions • Convert these Regular Expressions into Finite Automata Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  10. Lexemes and Tokens Here are some Java lexemes and the corresponding tokens: ; = index tmp 37 102 SEMI-COLON ASSIGN IDENT IDENT INT-LIT INT-LIT Note that multiple lexemes can correspond to the same token (e.g., there are many identifiers). Given the source code: position = initial + rate * 60 ; a Java scanner would return the following sequence of tokens: IDENT ASSIGN IDENT PLUS IDENT TIMES INT-LIT SEMI-COLON Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  11. The Scanner • Also called the Lexer • How it works: • Reads characters from the source program. • Groups the characters into lexemes (sequences of characters that "go together"). • Each lexeme corresponds to a token; • the scanner returns the next token (plus maybe some additional information) to the parser. • The scanner may also discover lexical errors (e.g., erroneous characters). • The definitions of what is a lexeme, token, or bad character all depend on the source language. Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  12. Two kinds of Automata Deterministic (DFA): • No state has more than one outgoing edge with the same label. Non-Deterministic (NFA): • States may have more than one outgoing edge with same label. • Edges may be labeled with  (epsilon), the empty string. • The automaton can take an  epsilon transition without looking at the current input character. Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  13. Regular Expressions to Finite Automata • Generating a scanner NFA Regular expressions DFA Lexical Specification Table-driven Implementation of DFA Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  14. BNF • Backus-Naur form, Backus-Normal form • A set of rules (or productions) • Each of which expresses the ways symbols of the language can be grouped together • Non-terminals are written upper-case • Terminals are written lower-case • The start symbol is the left-hand side of the first production • The rules for a CFG are often referred to as its BNF

  15. Java Identifier Definition Described in the Java specification: • http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html#44591 • “An identifier is an unlimited-length sequence of Java letters and Java digits, the first of which must be a Java letter. • An identifier cannot have the same spelling (Unicode character sequence) as a keyword (§3.9), Boolean literal (§3.10.3), or the null literal (§3.10.7).” Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  16. Java Identifier Definition Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  17. Java Integer Literals • An integer literal may be expressed in decimal (base 10), hexadecimal (base 16), or octal (base 8) • Examples: 0 2 0372 0xDadaCafe 1996 0x00FF00FF (opt means optional) Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  18. Defining Java Decimal Numerals A decimal numeral is either the single ASCII character 0, representing the integer zero, or consists of an ASCII digit from 1 to 9, optionally followed by one or more ASCII digits from 0 to 9, representing a positive integer: Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  19. Defining Floating-Point Literals A floating-point literal has the following parts: a whole-number part, a decimal point (represented by an ASCII period character), a fractional part, an exponent, and a type suffix. The exponent, if present, is indicated by the ASCII letter e or E followed by an optionally signed integer. Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  20. From the Lucene HTML Scanner

  21. The Functionality of the Parser • Input: sequence of tokens from lexical analysis • Output: parse tree of the program • parse tree is generated if the input is a legal program • if input is an illegal program, syntax errors are issued • Note: • Instead of parse tree, some parsers produce directly: • abstract syntax tree (AST) + symbol table, or • intermediate code, or • object code Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  22. Parser vs. Scanner Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  23. The Parser • Groups tokens into "grammatical phrases", discovering the underlying structure of the source program. • Finds syntax errors. • Example • position = * 5 ; • corresponds to the sequence of tokens: IDENT ASSIGN TIMES INT-LIT SEMI-COLON • All are legal tokens, but that sequence of tokens is erroneous. • Might find some "static semantic" errors, e.g., a use of an undeclared variable, or variables that are multiply declared. • Might generate code, or build some intermediate representation of the program such as an abstract-syntax tree. Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  24. What must the parser do? • Recognizer: not all strings of tokens are programs • must distinguish between valid and invalid strings of tokens • Translator: must expose program structure • e.g., associativity and precedence • must return the parse tree We need: • A language for describing valid strings of tokens • context-free grammars • (analogous to regular expressions in the scanner) • A method for distinguishing valid from invalid strings of tokens (and for building the parse tree) • the parser • (analogous to the state machine in the scanner) Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  25. Parser Example position = initial + rate * 60 ; = + position * initial rate 60 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  26. = + position * initial rate 60 The Semantic Analyzer • The semantic analyzer checks for (more) "static semantic" errors, e.g., type errors. • Annotates and/or changes the abstract syntax tree • (e.g., it might annotate each node that represents an expression with its type). • Example with before and after: (float) = (float) + position (float) (float) * initial (float) rate (float) int- to-float() (float) 60 (int) Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  27. Intermediate Code Generator The intermediate code generator translates from abstract-syntax tree to intermediate code. • One possibility is 3-address code. • Here's an example of 3-address code for the abstract-syntax tree shown above: temp1 = int-to-float(60) temp2 = rate * temp1 temp3 = initial + temp2 position = temp3 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  28. int count = 0; for (int j=0; j < 2*5; j++) { int temp = j + 1; count += 3; } The Optimizer • Examine the program and rewrite it in ways the preserve the meaning but are more efficient. • Incredibly complex programs and algorithms • Example • Move the declaration of temp outside the loop so it isn’t re-declared every time the loop is executed • Change 2*5 to 10 since it is a constant (no need to do an expensive multiply at run time) • If we removed the line with temp, the program might even skip the loop altogether • You can see in advance that count ends up = 30 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  29. The Code Generator • The code generator generates object code from (optimized) intermediate code. LOADF rate,R1 MULF #60.0,R1 LOADF initial,R2 ADDF R2,R1 STOREF R1,position Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  30. Tools • Scanner Generator • Used to create a scanner automatically • Input: • a regular expression for each token to be recognized • Output: • a finite state machine • Examples: • lex or flex (produce C code), or jlex (produce java) • Compiler Compilers • yacc (produces C) or JavaCC (produces Java, also has a scanner generator). Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  31. From the Lucene HTML Parser

  32. From the Lucene HTML Parser

  33. Graphs / Networks

  34. What is a Graph? Slide adapted from Goodrich & Tamassia

  35. Slide adapted from Goodrich & Tamassia

  36. Slide adapted from Goodrich & Tamassia

  37. Slide adapted from Goodrich & Tamassia

  38. Slide adapted from Goodrich & Tamassia

  39. Slide adapted from Goodrich & Tamassia

  40. Slide adapted from Goodrich & Tamassia

  41. Slide adapted from Goodrich & Tamassia

  42. Slide adapted from Goodrich & Tamassia

  43. Slide adapted from Goodrich & Tamassia

  44. Slide adapted from Goodrich & Tamassia

  45. Slide adapted from Goodrich & Tamassia

  46. Slide adapted from Goodrich & Tamassia

  47. Next Time • Graph Traversal • Directed Graphs (digraphs) • DAGS • Weighted Graphs Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

More Related