Week 2 – Lecture 1

Week 2 – Lecture 1 Compiler Construction • Lexical Analysis • The language of Lexical Analysis • Regular Expressions • DFAs and NFAs • Errors in Lexical Analysis Reading: section 2.6, Chapter 3

Lexical Analysis • Why split it from parsing? • Simplifies design • Parsers with whitespace and comments are more awkward • Efficiency • Only use the most powerful technique that works • And nothing more • No parsing sledgehammers for lexical nuts • Portability • More modular code • More code re-use

Source Code Characteristics • Code • Identifiers • Count, max, get_num • Language keywords • switch, if .. then.. else, printf, return, void • Mathematical operators • +, *, >> …. • <=, =, != … • Literals • “Hello World” • Comments • Whitespace

Language of Lexical Analysis Tokens Patterns Lexemes

Tokens are not enough… • Clearly, if we replaced every occurrence of a variable with a token then …. We would lose other valuable information • Other data items are attributes of the tokens • Stored in the symbol table

Token delimiters • When does a token/lexeme end? e.g xtemp=ytemp

Ambiguity in identifying tokens • A programming language definition will state how to resolve uncertain token assignment • <> Is it 1 or 2 tokens? • Disambiguating rules state what to do • Reserved keywords (e.g. if) take precedence over identifiers • ‘Principle of longest substring’

Regular Expressions • To represent patterns of strings of characters • REs • Alphabet – set of legal symbols • Meta-characters – characters with special meanings •  is the empty string • 3 basic operations • Choice – choice1|choice2, • a|b matches either a or b • Concatenation – firstthing secondthing • (a|b)c matches the strings { ac, bc } • Repetition (Kleene closure)– repeatme* • a* matches { , a, aa, aaa, aaaa, ….} • Precedence: * is highest, | is lowest • Thus a|bc* is a|(b(c*))

Regular Expressions (2) • We can add in regular definitions • digit = 0|1|2 …|9 • And then use them: • digit digit* • A sequence of 1 or more digits • One or more repetitions: • (a|b)(a|b)*  (a|b)+ • Any character in the alphabet . • .*b.* - strings containing at least one b • Ranges [a-z], [a-zA-Z], [0-9], (assume character set ordering) • Not: ~a or [^a]

Limitations of REs • REs can describe many language constructs but not all • For example Alphabet = {a,b}, describe the set of strings consisting of a single a surrounded by an equal number of b’s S= {a, bab, bbabb, bbbabbb, …}

Transition Diagrams •  Algorithm to match REs start digit 1 2 Double lines means an accepting state Matches all single digits Anything else goes to an ‘error state’ not usually shown

Lookahead • <=, <>, < • When we read a token delimiter to establish a token we need to make sure that it is still available • It is the start of the next token! • This is lookahead • Decide what to do based on the character we ‘haven’t read’ • Sometimes implemented by reading from a buffer and then pushing the input back into the buffer • And then starting with recognizing the next token

Classic Fortran example • DO 99 I=1,10 becomes DO99I=1,10 versus DO99I=1.10 • When can the lexical analyzer assign a token? • Push back into input buffer • or ‘backtracking’

Transition Diagrams (2) Attach return values with accepting states < = = less_eq start 1 2 3 other other is context sensitive > * 4 5 Lookahead * = [other] = less_than = not_eq

Transition Diagrams (3) 3, +2, -45, +379, 1001… digit + digit start 1 2 3 - digit (+|-)? digit digit* = + digit digit* | - digit digit* | digit digit*

DFAs = LE < start > 1 NE < This is not a DFA because we have 3 different possible moves from state 1 < LT

NFAs  - transitions   - transitions can ‘glue together’ automata enabling us to build large automata easily from lots of small ones

RE -> NFA (Thompson’s Construction) a b a  b ab a   a|b b      a a* 

Overall Picture Regular Expression Subset construction Algorithm Algorithm 3.2, pg 118 NFA DFA Thompson’s construction Algorithm Algorithm 3.3, pg 122 Program Or write an ad-hoc Lexical Analyzer Fig 3.22, pg 116 RE -> DFA, pg 135 Flex

Lexical Errors • Only a small %age of errors can be recognised during Lexical Analysis Consider fi (good == bad) … if = good;

Examples from the oberon language (QUT) • Line ends inside literal string • Illegal character in input file • Input file ends inside a comment • Invalid exponent in REAL constant • Number too long • Illegal use of underscore in identifier

In general • What does a lexical error mean? • Strategies: • “Panic-mode” • Delete chars from input until something matches • Inserting characters • Re-ordering characters • Replacing characters • For an error like “illegal character” then we should report it sensibly

Week 2 – Lecture 1

Week 2 – Lecture 1

Presentation Transcript

Week 2 – Lecture 1

COIT29222-Structured Programming Lecture Week 12

RT 124 – week 2 SHOULDER

3 Week Diet System Pdf 3-week Diet Plan For Men

Revision of Electrostatics (week 1) Vector Calculus (week 2-4) Maxwell’s equations in integral and differential form

ISM 270

Week 1

Lecture Virtual Communities

COMS/CSEE 4140 Networking Laboratory Lecture 06

Announcements

6.096 Lecture 10

Week 5 Lecture slides

KP3213 CAD/CAM

ECE-L304 Lecture 6

Week 1 Lecture Review

“Elementary Particles” Lecture 6

Cold atoms

HCI460: Week 1 Lecture

Rels 205 Lecture 3.1 Religious Traditions

Lecture 22