1 / 34

Introduction to Compilers 66416 Computer Engineering Spring Semester 2005

Introduction to Compilers 66416 Computer Engineering Spring Semester 2005. Dr. Raed Alqadi. Why Take This Course?. Interested in writing compilers of programming languages. Curious about compiler. Requirement. Why shouldn’t you take course? Think it will be easy.

rusk
Télécharger la présentation

Introduction to Compilers 66416 Computer Engineering Spring Semester 2005

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Compilers66416Computer EngineeringSpring Semester 2005 Dr. Raed Alqadi

  2. Why Take This Course? • Interested in writing compilers of programming languages. • Curious about compiler. • Requirement. • Why shouldn’t you take course? • Think it will be easy.

  3. Why are Compilers Interesting? • Mixture of formal techniques and ad-hoc programming. • Earliest non-trivial programs. • Useful and prevalent tools. • Special purpose languages are part of most programs. • Techniques are widely applicable. • Good field to find job or do research. • New computers have produced new problems.

  4. Pieces of a Compiler • Will discuss scanning, parsing, (optimization), code generation. • Input language can be any programming language. • Output is usually machine/assembly language. • Could be another language (e.g. C). • Read Dragon Ch1.

  5. Why This Organization? • Traditional (and effective). • Functional decomposition. • Separate theory for each area. • Large size requires abstractions. • What is the alternative.

  6. Example: Scanner • Input is sequence of characters. • If x>100 then y :=1 else y:=2; • Output is tokens (lexemes).

  7. Example: Parser • Input is tokens. • . • Output is syntax tree.

  8. Example: Optimizer • Input is syntax tree. • Output is modified syntax tree. Assume y has the value 2 before the conditional statement.

  9. Example: Code Generation • Input is syntax tree. • Output is assembly (machine) language program. ble x, 100, L505 mov y, 1 j L506 L505: mov y, 2 L506:

  10. Extended example to illustrate compiler • Light bulb language (LBL) • Language for controlling a light bulb. • Syntax of the two statements: • on <time>. • off <time>. • Where <time> is an integer.

  11. Syntax of Language • <stmt> := on_stmnt | off_stmnt on_stmnt := on <int> ; off_stmnt := off <int> ; • For example: • on 33; off -89; off 0; Backus-Naur Form (BNF)

  12. Semantics of Language • What is the ‘meaning’ of a program written in a language? • Statements processed in order in which they appear. • on N turns on light bulb for N ticks. • off N turns off light bulb for N ticks. • Time must be non-negative. • For example: • on 3; off 2; off 6; on 0; • on -1; illegal program.

  13. Light bulb Computer • Light bulb controlled by machine with two instructions. • lt_on turns on light for 1 tick. • lt_off turns off light for 1 tick. • Semantic mismatch between language and machine. • N ticks vs. 1 tick. • Compiler bridges gap by transiting program. • on 3; lt_on off 2; lt_on on 1; lt_on lt_off lt_off lt_on

  14. Pieces of a Compiler

  15. Scanning LBL • Recognize keywords (on, off), integers, ‘ ‘ white space. • Ad-hoc scanner: • token scan (FILE *input) { while(1){ c = read_char (input); if (c == ‘o’){ c = read_char (input); if (c == ‘n’) return ON_TOKEN; else if (c== ‘f’) . . . } else if (c >= ‘0’ && c <= ‘9’) return scan_integer (input); else if (c == ‘;’) return SEMI_TOKEN; else if (c == ‘ ‘) continue; else error (“Unknown character”); } }

  16. Parsing and Compiling LBL • Parser gets stream tokens from scanner • ON_TOKEN, OFF_TOKEN, INT_TOKEN, SEMI_TOKEN • parse (FILE *input) { t = scan (input); if (t == ON_TOKEN) { t = scan (input); if (t != INT_TOKEN) syntax_error (“Expected int”); else for (i = 0; i < token_value (t); i++) generate (“lt_on”); } else if (t == OFF_TOKEN) { . . . } else syntax_error (“Unknown keyword”); }

  17. Optimization • New instruction • lt_toggle = lt_on; lt_off; or lt_off; lt_on; • Copmiler must recognize transition between states. • Difficult in parser since it doesn’t know next state when generating code for current state. • Build representation of program and examine whole sequence before optimizing or generating code. • Keep an array of values • a[i] > 0 means light is on for N ticks. • a[i] < 0 means light is off for N ticks.

  18. Optimization, cont’d • parse (FILE * input) { t = scan (input); if (t == ON_TOKEN) { t = scan (input); if (t != INT_TOKEN) syntax_error (“Expected int”); else a[i++] = token_value(t); } else if (t == OFF_TOKEN) { . . . } else syntax_error (“Unknown keyword”); }

  19. Code Generation After Optimization • Generate (int *a) { for (I =0; i< length(a); i++) { if (i<length(a) and a[i] := a[i+1]) generate(“lt_toggle”); else if (a[i] > 0) for (j = 0; j < a[i]; j++) generate(“lt_on”); else for (j = 0; j < -a[i]; j++) generate(“lt_off”); } }

  20. Constructing a Scanner • Need form all way to describe item recognized by scanner. • Items called tokens or lexemes. • Regular expressions (REs). • Automatic techniques for constructing scanners from REs. • Dragon: Sec 3.1-3.7.

  21. How to Describe Tokens • English. • e. g. “an identifier is sequence of letters and digits that does not begin with a digit” • What about A_3? • Verbose, but not precise. • e. g. “floating point number is sequence of one or more digits followed by a decimal point followed by a sequence of one or more digits”. • Too complex and verbose. • -1.4, 1.0e9. • Regular Expressions (REs) • Concise and precise notation. • Efficient translation to finite state machine that mechanically recognizes tokens.

  22. What is a Regular Expression (REs)? • Language for describing a set of strings. • A string S is in the language of RE R if R matches S. • Build from a simple set of rules. • Start with alphabet A of Symbols (i. e characters). • A = {a, …, z, A, …, Z, 0, …, 9}. • Also allow the empty string λ. • Concatenate REs with ‘.’ • If R1 and R2 are REs, so is R1.R2. • a.b is an ‘a’ followed by a ‘b’. 1.0.1 is a ‘1’, a ‘0’, and finally a ‘1’.

  23. REs, cont’d • Alternate RE with ‘|’ • If R1 and R2 are REs, so is R1|R2. • 0|1 is a ‘0’ or ‘1’. (a.b)|c is ‘ab’ or ‘c’. • 0|1|2|3|4|5|6|7|8|9 matches a digit. • Repeat REs with ‘*’ • If R1 is a RE, then R1* is a RE. • Means 0 or more repeated times of R1. • a* matches λ, a, aa, aaa, aaaa, …. (0|1|2|3|4|5|6|7|8|9)* matches positive integers.

  24. REs, Precisely • L(R) is a set of strings matched by RE R. x Є A ==> L (x) = {x}. R = R1 . R2 ==> L (R) = {ab : a Є L (R1), b Є L (R2)}. R = R1 | R2 ==> L (R) = L (R1) U L (R2). R = R1* ==> L (R) = Us=0 { as: a Є L (R1)}

  25. Examples of REs • Letter := a|b . . . y|z|A|B . . . Y|Z • Digit := 0|1|2|3|4|5|6|7|8|9 • IdChar := Letter | Digit := {a|b . . . y|z|A|B . . . Y|Z|0| . . . |9} • Id := Letter . IdChar*. • FPNum := Digit . Digit ’.’ Digit . Digit* • Note that decimal point must be quoted

  26. More Concise Notation (Unix) • Square brackets delimit alternative set • [a-z] = a|b … y|z [0-9] = 0|1 … 8|9 • Allow more than one set in bracket. • [a-z,A-Z]. • Plus super script is one or more repeating. • R+= R.R*. • e.g: Id := [a-z,A-Z].[a-z,A-Z,0-9]*. FPNum := [0-9]+’.’[0-9]+

  27. Why REs? • Good notation for describing lexemes. • Directly translatable to program that recognizes RE. • Produce abstract machine that recognizes strings. • Translate to efficient program that dose the same.

  28. PA1: Symbol Tables and Modularity • Goal is to build first part of compiler: the symbol table. • Another goal is to introduce a good programming style: modularity and abstract data types. • The symbol table holds information on the identifiers in a program. • Identifiers are the names of variables, functions, constants, etc. • The compiler hands the symbol table a string containing the identifier’s name and gets back a record containing information on the identifier. “x” Symbol Table

  29. Interface • Two data types: • symbol_table – a symbol table. • Symbol_table_entry – entry in a symbol table with information on an identifier. • Five functions: • symbol_table *make_symbol_table (int fold_case). • symbol_table_entry *get_symbol (symbol_table tbl, char *str). • symbol_table_entry *put_symbol (symbol_table tbl, char *str). • symbol_table *clear_symbol_table (symbol_table tbl). • Void print_statistics (symbol_table tbl).

  30. Digression: Abstract Datatypes (ADTs) • An abstract data type (ADT) is a building block. • An object with a carefully specified interface. • The interface isolates clients from implementation details. • Change the way that ADT is built without modifying clients. • Implications: • Cannot let clients outside ADT see internal details. • Export an interface with a complete set of operations. • Only the functions exported by ADT know the details. • Clients hold instances of object and invoke operations on them.

  31. ADTs in C++/C • C++ Classes Support ADT • C does not support this program development methodology. • Must reveal full definition of a struct to clients. • Need opaque datatypes. • Can support ADTs by convention (not compiler checking). • In C, Break a datatype into two parts: • Interface definition (.h file). • Define the datatypes exported by ADT and operations on them. • Implementation (.c file) • Actually implement the object and operations. • Make everything not in the interface static.

  32. Symbol Table ADT • I have provided the interface definition • (symbol_table.h) • You need to write the 5 symbol table routines described earlier. • I have also written a test driver routine that users the interface • test_symbol.c • You should not have to look at the code in this file.

  33. Implementing Symbol Tables • Probably the best (and simplest) implementation is a hash table • Have a large table holding entries. • Want to find the entry associated with a particular string. • Where do you start looking in table? • “Hash” the string by computing a function to the integers • Function should be chosen so entries are evenly distributed • What if two entries fall in the same slot? • Search nearby entries. • Chain together entries into a linked list.

  34. Hash functions • Two desirable properties. • Fast to compute • Spread expected collection of strings over table evenly. • A reasonable function is: x = strlen (str); While (*str != ‘\0’) { x = x ^ *str; x = x << 1; str ++; } • Worry about common cases: i, j, x1, x2, x3, … .

More Related