Exploring Programming Languages: A Comprehensive Overview

CS 3304Comparative Languages • Lecture 15:Midterm Exam Review • 13 March 2012

Why Study Programming Languages? (Ch. 1) • Help you choose a language: • Make it easier to learn new languages some languages are similar; easy to walk down family tree. • Help you make better use of whatever language you use: • Understand obscure features. • Understand implementation costs: choose between alternative ways of doing things, based on knowledge of what will be done underneath. • Figure out how to do things in languages that don't support them explicitly. • Figure out how to do things in languages that don't support them explicitly.

Language Groups • Imperative: • von Neumann (Fortran, Pascal, Basic, C). • Object-oriented (Smalltalk, Eiffel, C++?). • Scripting languages (Perl, Python, JavaScript, PHP). • Declarative: • Functional (Scheme, ML, pure Lisp, FP). • Logic, constraint-based (Prolog, VisiCalc, RPG). • Imperative languages, particularly the von Neumann languages, predominate: • They will occupy the bulk of our attention. • We also plan to spend a lot of time on functional, logic languages.

The von Neumann Architecture (Ch. 2) • Fetch-execute-cycle (on a von Neumann architecture computer) initialize the program counter repeat forever fetch the instruction pointed by the counter increment the counter decode the instruction execute the instruction end repeat

Programming Methodologies • 1950s and early 1960s - Simple applications; worry about machine efficiency. • Late 1960s - People efficiency became important; readability, better control structures: • Structured programming. • Top-down design and step-wise refinement. • Late 1970s - Process-oriented to data-oriented: • Data abstraction. • Middle 1980s - Object-oriented programming: • Data abstraction + inheritance + polymorphism.

Compilation vs. Interpretation • Not opposites. • Not a clear-cut distinction. • Interpretation: • Greater flexibility. • Better diagnostics (error messages). • Compilation: • Better performance.

Compilation Phases

Defining Languages • Recognizers: • A recognition device reads input strings over the alphabet of the language and decides whether the input strings belong to the language. • Example: syntax analysis part of a compiler (scanning). • Generators: • A device that generates sentences of a language. • One can determine if the syntax of a particular sentence is syntactically correct by comparing it to the structure of the generator.

Regular Expressions • A regular expression is one of the following: • A character. • The empty string, denoted by ε. • Two regular expressions concatenated. • Two regular expressions separated by | (i.e., or). • A regular expression followed by the Kleene star (concatenation of zero or more strings). • Numerical literals in Pascal may be generated by the following:

Context-Free Grammars • Context-Free Grammars: • Developed by Noam Chomsky in the mid-1950s. • Language generators, meant to describe the syntax of natural languages. • Define a class of languages called context-free languages. • Backus-Naur Form (1959): • Invented by John Backus to describe Algol 58. • BNF is equivalent to context-free grammars (CFGs). • A CFG consists of: • A set of terminals T. • A set of non-terminals N. • A start symbol S (a non-terminal). • A set of productions.

BNF Fundamentals • In BNF, abstractions are used to represent classes of syntactic structures: they act like syntactic variables (also called nonterminal symbols, or just terminals). • Terminals are lexemes or tokens. • A rule has a left-hand side (LHS), which is a nonterminal, and a right-hand side (RHS), which is a string of terminals and/or nonterminals. • Nonterminals are often italic or enclosed in angle brackets. • Examples of BNF rules: <ident_list> → identifier | identifier, <ident_list> <if_stmt> → if <logic_expr> then <stmt> • Grammar: a finite non-empty set of rules. • A start symbol is a special element of the nonterminals of a grammar.

Scanner Responsibilities • Tokenizing source. • Removing comments. • (Often) dealing with pragmas (i.e., significant comments). • Saving text of identifiers, numbers, strings. • Saving source locations (file, line, column) for error messages.

Deterministic Finite Automaton • Pictorial representation of a scanner for calculator tokens, in the form of a finite automaton. • This is a deterministic finite automaton (DFA): • Lex, scangen, ANTLR, etc. build these things automatically from a set of regular expressions. • Specifically, they construct a machine that accepts the language.

Building Scanners • Scanners tend to be built three ways: • Ad-hoc. • Semi-mechanical pure DFA (usually as nested case statements). • Table-driven DFA. • Ad-hoc generally yields the fastest, most compact code by doing lots of special-purpose things, though good automatically-generated scanners come very close. • Writing a pure DFA as a set of nested case statements is a surprisingly useful programming technique (Figure 12.1): • It is often easier to use perl, awk, sed or similar tools. • Table-driven DFA is what lex and scangen produce: • lex (flex): C code • scangen: numeric tables and a separate driver (Figure 2.12). • ANTLR: Java code.

Parsing • By analogy to regular expressions and DFAs, a context-free grammar (CFG) is a generator for a context-free language (CFL): • A parser is a language recognizer. • There is an infinite number of grammars for every context-free language: • Not all grammars are created equal, however. • It turns out that for any CFG we can create a parser that runs in O(n3) time. • There are two well-known parsing algorithms that permit this: • Early's algorithm. • Cooke-Younger-Kasami (CYK) algorithm. • O(n3) is unacceptable for a parser in a compiler: too slow.

Faster Parsing • Fortunately, there are large classes of grammars for which we can build parsers that run in linear time: • The two most important classes are called LL and LR. • LL stands for ‘Left-to-right, Leftmost derivation’. • LR stands for ‘Left-to-right, Rightmost derivation’. • LL parsers are also called ‘top-down’, or 'predictive' parsers & LR parsers are also called ‘bottom-up’, or 'shift-reduce' parsers. • There are several important sub-classes of LR parsers. • Simple LR parser (SLR). • Look-ahead LR parser (LALR). • We won't be going into detail on the differences between them.

LL Parsing • Like the bottom-up grammar, this one captures associativity and precedence, but most people don't find it as pretty: • For one thing, the operands of a given operator aren't in a RHS together! • However, the simplicity of the parsing algorithm makes up for this weakness. • How do we parse a string with this grammar? • By building the parse tree incrementally.

LR Parsing • LR parsers are almost always table-driven: • Like a table-driven LL parser, an LR parser uses a big loop to repeatedly inspects a two-dimensional table to find out what action to take. • Unlike the LL parser, the LR driver has non-trivial state (like a DFA), and the table is indexed by current input token and current state. • The stack contains a record of what has been seen SO FAR (NOT what is expected).

Core Issues (Ch. 3) • The early development of programming languages was driven by two complementary goals, machine independence and ease of programming. • Machine Independence: a programming language should not rely on the features of any particular instruction set for its efficient implementation (e.g., Java). • Ease of programming: more elusive and a matter of science than of aesthetics and trial and error. • Core issues for the midterm: • Names, scopes, and bindings: Chapter 3. • Control-flow constructs: Chapter 6. • Data types: Chapter 7.

Name, Scope, and Binding • A name is a mnemonic character string used to represent something else: • Most names are identifiers. • Symbols (like '+') can also be names. • A binding is an association between two things, such as a name and the thing it names. • The scope of a binding is the part of the program (textually) in which the binding is active. • Binding Time is the point at which a binding is created or, more generally, the point at which any implementation decision is made. • The terms “static” and “dynamic” are generally used to refer to things bound before run time and at run time, respectively.

Storage Allocation Mechanisms • Static: objects are given an absolute address that is retained throughout the program’s execution. • Stack: objects are allocated and deallocated in last-in, first-out order, usually in conjunction with subroutine calls and returns. • Heap: objects may be allocated and deallocated at arbitrary times. They require a more general (and expensive) storage management algorithm.

Static Allocation Examples • Global variables: accessible throughout the program. • Code: the machine instructions. • Static local variables: retain their values from one invocation to the next. • Explicit constants (including strings, sets, etc.): • Small constants may be stored in the instructions. • Tables: most compilers produce a variety of tables used by runtime support routines (debugging, dynamic-type checking, garbage collection, exception handling).

Stack • Central stack for parameters,local variables and temporaries. • Why a stack? • Allocate space for recursive routines: not necessary if no recursion. • Reuse space: in all programming languages. • Contents of a stack frame (Figure 3.1): arguments and returns, local variables, temporaries, bookkeeping (saved registers, line number static link, etc.). • Local variables and arguments are assigned fixed offsets from the stack pointer or frame pointer at compile time. • Maintenance of stack is responsibility of calling sequence and subroutine prolog and epilog: • Saving space: putting as much in the prolog and epilog as possible. • Time may be saved by putting stuff in the caller instead or combining what's known in both places (interprocedural optimization).

Heap-Based Allocation • Heap: a region of storage in which subblocks can be allocated and deallocated at arbitrary times. • Heap space management: speed vs. spacetradeoffs. • Space concerns: • Internal fragmentation: allocating a block large than required. • External fragmentation: unused space is fragmented so the ability to meet allocation requests degrades over time.

Static Scoping • Static (lexical) scope rules: a scope is defined in terms of the physical (lexical) structure of the program: • The determination of scopes can be made by the compiler. • All bindings for identifiers can be resolved by examining the program. • Typically, we choose the most recent, active binding made at compile time. • Most compiled languages, C and Pascal included, employ static scope rules. • The classical example of static scope rules is the most closely nested rule used in block structured languages such as Algol 60 and Pascal (nested subroutines): • An identifier is known in the scope in which it is declared and in each enclosed scope, unless it is re-declared in an enclosed scope. • To resolve a reference to an identifier, we examine the local scope and statically enclosing scopes until a binding is found.

Dynamic Scoping • The key idea in static scope rules is that bindings are defined by the physical (lexical) structure of the program. • With dynamic scope rules, bindings depend on the current state of program execution: • They cannot always be resolved by examining the program because they are dependent on calling sequences. • To resolve a reference, we use the most recent, active binding made at run time. • Dynamic scope rules are usually encountered in interpreted languages: • Early LISP dialects assumed dynamic scope rules. • Such languages do not normally have type checking at compile time because type determination isn’t always possible when dynamic scope rules are in effect.

Binding of Referencing Environments • Referencing environment of a statement at run time is the set of active bindings. • A referencing environment corresponds to a collection of scopes that are examined (in order) to find a binding. • Scope rules determine that collection and its order. • Binding rules determine which instance of a scope should be used to resolve references when calling a procedure that was passed as a parameter: • They govern the binding of referencing environments to formal procedures. • Shallow binding: the referencing environment created only when the subroutine is actually called. • Deep binding: the referencing environment when the subroutine was passed as a parameter.

Semantic Analyzer (Ch. 4) • The principal job of the semantic analyzer is to enforce static semantic rules: • Constructs a syntax tree (usually first). • Information gathered is needed by the code generator. • This interface is a boundary between the front end and the back end. • There is a considerable variety in the extent to which parsing, semantic analysis, and intermediate code generation are interleaved. • Fully separated phases: a full parse tree, a syntax tree, and semantic check. • Fully interleaved phases: no need to build both pars and syntax trees. • A common approach interleaves construction of a syntax tree with parsing (no explicit parse tree), follows with separate, sequential phases for semantic analysis and code generation.

Static Analysis • Compile-time algorithms that predict run-time behavior. • It is precise if it allows the compiler to determine whether a given program will always follow the rules: type checking. • Also useful when not precise: a combination of compile time check and code for run time checking. • Static analysis is also used for code improvement: • Alias analysis: when values can be safely cached in registers. • Escape analysis: all references to a value confined to a given context. • Subtype analysis: an OO variable is of a certain subtype. • Unsafe and speculative optimization. • Conservative and optimistic compilers. • Some languages have tighter semantic rules to avoid dynamic checking.

Attribute Grammars • Both semantic analysis and (intermediate) code generation can be described in terms of annotation, or “decoration” of a parse or syntax tree. • Attribute grammars provide a formal framework for decorating such a tree. • The attribute grammar serves to define the semantics of the input program. • Attribute rules are best thought of as definitions, not assignments. • They are not necessarily meant to be evaluated at any particular time, or in any particular order, though they do define their left-hand side in terms of the right-hand side.

Synthesized Attributes • The S-attributed grammar uses only synthesized attributes. • Its attribute flow (attribute dependence graph) is purely bottom-up. • The arguments to semantic functions in an S-attributed grammar are always attributes of symbols on the right-hand side of the current production. • The return value is always placed into an attribute of the left hand side of the production. • The intrinsic properties of tokens are synthesized attributes initialized by the scanner.

Inherited Attributes • Inherited attributes: values are calculated when their symbol is on the right-hand side of the current production. • Contextual information flow into a symbols for above or from the side: provide different context. • Symbol table information is commonly passed be means of inherited attributes. • Inherited attributes of the root of the parse tree can be used to represent external environment. • Example: left-to-right associativity may create a situation where an S-attributed grammar would be cumbersome to use. By passing attribute values left-to-right in the tree, things are much simpler.

Parsers and Attribute Grammars • Each synthetic attribute of a LHS symbol (by definition of synthetic) depends only on attributes of its RHS symbols. • A bottom-up parser: in general paired with an S-attributed grammar. • Each inherited attribute of a RHS symbol (by definition of L-attributed) depends only on: • Inherited attributes of the LHS symbol, or • Synthetic or inherited attributes of symbols to its left in the RHS. • A top-down parser: in general paired with an L-attributed grammar. • There are certain tasks, such as generation of code for short-circuit Boolean expression evaluation, that are easiest to express with non-L-attributed attribute grammars. • Because of the potential cost of complex traversal schemes, however, most real-world compilers insist that the grammar be L-attributed.

Translation Scheme • There are automatic tools that construct a semantic analyzer (attribute evaluator) for a given attribute grammar. In other words, they generate translation schemes for context-free grammars or tree grammars (which describe the possible structure of a syntax tree): • These tools are heavily used in syntax-based editors and incremental compilers. • Most ordinary compilers, however, use ad-hoc techniques. • Most production compilers use an ad hoc, handwritten translation scheme: • Interleave parsing with at least the initial construction of a syntax tree. • Possibly all of semantic analysis and intermediate code generation. • Since the attributes of each production are evaluated as the production is parsed, there is no need for the full parse tree.

Action Routines • An ad-hoc translation scheme that is interleaved with parsing takes the form of a set of action routines: • An action routine is a semantic function that we tell the compiler to execute at a particular point in the parse. • Semantic analysis &code generation interleaved with parsing: action routines can be used to perform semantic checks and generate code. • LL parser generator: an action routine can appear anywhere within a right-hand side. • Implementation: when the parser predicts a production, the parser pushes all of the right hand side onto the stack. • If semantic analysis & code generation are separate phases, then action routines can be used to build a syntax tree: • A parse tree could be built completely automatically. • We wouldn't need action routines for that purpose.

Space Management for Attributes • If there is a parse tree, the attributes can be stored in nodes. • For a bottom-up parser with an S-attributed grammar, maintain an attribute stack mirroring the parse stack: • Next to every state number is an attribute record for the symbol shifted when entering the state. • Entries are pushed and popped automatically. • For a top-down parser with an L-attributed grammar: • Automatic: an attribute stack that does not mirror the parse stack. • Short-cutting copy rules: action routines allocate and deallocate space for attributes explicitly. • Contextual information: • Symbol table that always represents the current referencing environment.

Chapter 5 • Not included in the exam.

Control Flow (Ch. 6) • Control flow (or ordering) in program execution. • Eight principal categories of language mechanisms used to specify ordering: • Sequencing. • Selection. • Iteration. • Procedural abstraction. • Recursion. • Concurrency. • Exception handling and speculation. • Nondeterminacy. • The relative importance of different categories of control flow varies significantly among the different classes of programming languages.

Expression Evaluation • Expression: • A simple object: e.g., a literal constant, a named variable or constant. • An operator or function applied to a collection of operands or arguments, each of which in turn is an expression. • Function calls: a function name followed by a parenthesized, comma-separated list of arguments. • Operator: built-in function that uses special, simple syntax – one or two arguments, no parenthesis or commas. • Sometimes they are “syntactic sugar” for more “normal” looking functions (in C++ a+b is shprt for a.operator+(b)) • Operand: an argument of an operator.

Precedence and Associativity • Infix notation requires the use of parenthesis to avoid ambiguity. • The choice among alternative evaluation orders depends on the precedence and associativity of the operators: • C has very rich precedence structure: problems with remembering all the precedence levels (15 levels). • Pascal has relatively flat precedence hierarchy (3 levels). • APL and Smalltalk: all operators are of equal precedence. • Associativity rules specify whether sequences of operators of equal precedence group to the right or to the left: • Usually the operators associate left-to-right. • Fortran: the exponentiation operator ** associates right-to-left. • C: the assignment operator associates right-to-left.

References and Values • Subtle but important differences in the semantics of assignment in different imperative languages. • Based on the context, a variable may refer to the value of the variable (r-value) or its location (l-value) – a named container for a value. • Value model of variables: an expression can be either an l-value or an r-value, based on the context in which it appears. • Built-in types can’t be passed uniformly to methods expecting class type parameters: wrapper classes, automatic boxing/unboxing. • Reference model of variables: a variable is a named reference for a value – every variable is an l-value. • E.g., integer values (like 2) are immutable. • A variable has to be dereferenced to obtainits value. a 4 a 4 b 2 b 2 c 2 c

Structured and Unstructured Flow • Assembly language: conditional and unconditional branches. • Early Fortran: relied heavily on goto statements (and labels): IF (A .LT. B) GOTO 10 …10 • Late 1960s: Abandoning of GOTO statements started. • Move to structured programming in 1970s: • Top-down design (progressive refinement). • Modularization of code. • Descriptive variable. • Within a subroutine, a well-designed imperative algorithm can be expressed with only sequencing, selection, and iteration. • Most of the structured control-flow constructs were introduced by Algol 60.

Sequencing • The principal means of controlling the order in which side effects occur. • Compound statement: a delimited list of statements. • Block: a compound statement optionally preceded by a set of declarations. • The value of a list of statements: • The value of its final element (Algol 68). • Programmers choice (Common Lisp – not purely functional). • Can have side effects; very imperative, von Neumann. • There are situations where side effects in functions are desirable: random number generators. • Euclid and Turing: functions are not permitted to have side effects.

Selection • Selection statement: mostly some variant of if…then…else. • Languages differ in the details of the syntax. • Short-circuited conditions: • The Boolean expression is not used to compute a value but to cause control to branch to various locations. • Provides a way to generate efficient (jump) code. • Parse tree: inherited attributes of the root inform it of the address to which control should branch:if ((A > B) and (C > D)) or (E ≠ F) then r1 := A r2 := Bthen_clause if r1 <= r2 goto L4else r1 := C r2 := Delse_clause if r1 > r2 goto L1 L4: r1 := E r2 := F if r1 = r2 goto L2 L1: then_clause goto L3 L2: else_clause L3:

Iteration • Iteration: a mechanism that allows a computer to perform similar operations repeatedly. • Favored in imperative languages. • Mostly some form of loops executed for their side effects: • Enumeration-controlled loops: executed once of every value in a given finite set. • Logically controlled loops: executed until some Boolean condition changes value. • Combination loops: combines the properties of enumeration-controlled and logically controlled loops (Algol 60). • Iterators: executed over the elements of a well-defined set (often called containers or collections in object-oriented code).

Recursion • Recursion requires no special syntax: why? • Recursion and iteration are equally powerful. • Most languages provide both iteration (more “imperative”) and recursion (more “functional”). • Tail-recursive function: additional computation never follows a recursive call. The compiler can reuse the space, i.e., no need for dynamic allocation of stack space.int gcd(int a, int b) { if (a == b) return a; else if (a > b) return gcd(a - b,b); else return gcd(a, b – a);} • Sometimes simple transformations are sufficient to produce tai-recursive code: continuation-passing style.

Data Types (Ch. 7) • Implicit context for many operations: • The programmer does not have to specify the context explicitly. • Example: in C, the expressions a+b will use integer addition if a and b are integers, floating point addition if a and b are floating points. • Limit the set of operations that may be performed in a semantically valid program: • Example: prevent from adding a character and a record. • Type checking cannot prevent all meaningless operations. • It catches enough of them to be useful.

Classification of Types • Discrete (ordinal) types – countable: integer, boolean, char, enumeration, and subrange. • Scalar (simple) types - one-dimensional: discrete, rational, real, and complex. • Composite types: • Records (structures). • Variant records (unions). • Arrays; strings are arrays of characters. • Sets: the mathematical powerset of their base types. • Pointers: l-values. • Lists: no notion of mapping or indexing. • Files.

Type Systems • A type system consists of: • A mechanism to defines types and associate them with certain language constructs. • A set of rules for type equivalence, type compatibility, type inference: • Type equivalence: when are the types of two values the same? • Type compatibility: when can a value of type A be used in a context that expects type B? • Type inference: what is the type of an expression, given the types of the operands? • Compatibility is the more useful concept, because it tells you what you can do. • Polymorphism results when the compiler finds that it doesn't need to know certain things. • Subroutines need to have types if they are first- or second-class value.

Type Checking • Type checking is the process of ensuring that a program obeys the language’s type compatibility rules. • Type clash: a violation of these rules. • Strong typing means that the language prevents you from applying an operation to data on which it is not appropriate. • Static typing: compiler can check all at compile time. • Examples: • Common Lisp is strongly typed, but not statically typed. • Ada is statically typed, Pascal is almost statically typed. • C less strongly typed than Pascal, e.g. unions, subroutines with variable numbers of parameters. • Java is strongly typed, with a non-trivial mix of things that can be checked statically and things that have to be checked dynamically. • Scripting languages are generally dynamically typed.

Exploring Programming Languages: A Comprehensive Overview

Exploring Programming Languages: A Comprehensive Overview

Presentation Transcript

CS 3304 Comparative Languages

CS 3304 Comparative Languages

CS 3304 Comparative Languages

CS 3304 Comparative Languages Fall 2011

CS 3304 Comparative Languages

CS 3304 Comparative Languages

CS 3304 Comparative Languages

CS 3304 Comparative Languages

CS 3304 Comparative Languages

CS 3304 Comparative Languages

CS 3304 Comparative Languages

CS 3304 Comparative Languages