LANGUAGE TRANSLATORS: WEEK 14

LANGUAGE TRANSLATORS: WEEK 14 LECTURE: REGULAR EXPRESSIONS FINITE STATE MACHINES LEXICAL ANALYSERS INTRO TO GRAMMAR THEORY TUTORIAL: CAPTURING LANGUAGES USING REGULAR EXPRESSIONS

LEXICAL ANALYSIS • Is the first step in the translation/compilation process input language ====> output language • means putting the raw characters of the input into TOKENS.

LEXICAL ANALYSIS PHASE • The language of TOKENS e.g. Identifiers is always a regular language. • REGULAR EXPRESSIONS generate regular languages (as do Regular Grammars..) The tokens of languages are often specified by regular expressions. • Finite State Machines consume regular languages

REGULAR EXPRESSIONS • One line method of specifying a language • equivalent to `type 3’ or regular grammars • used to parameterize UNIX/LINUX file processing commands

REGULAR EXPRESSIONS - DEFINITION EXAMPLE DEFINITION a | b ‘|’ means choice a | b | c = [abc] ‘[..]’ is shorthand for multiple choice e ‘e‘ means the empty word (abc)* ‘*’ means repetition 0,1 or more .. (abcd)+ ‘+’ means repetition 1 or more times

REGULAR EXPRESSIONS - EXAMPLES • [a - z A - Z][a - z A - Z 0 - 9]* defines the language of IDENTIFIERS in some programming languages • (xyz)* defines the language {e , xyz, xyzxyz, xyzxyzxyz, ..} • [abcd]+ defines the language {a, b, c, d, aa, ab, ac, ad, ba, bb, bc, bd, ca, ..} Putting choice and repetition together produces complicated regular languages

Finite State Machines • Can be defined by annotated nodes and arcs. • Can translate Reg. Exps into FSMs but must add ERROR STATES onto the FSMs

Regular Expression ==> NDFSM ab [ab] a* then NDFSM ==> FSM.. a b a b a

Example • Specify a language of alphabet { w,x,y,z} with the only restrictions being that • 1. no strings contain both x and y, and • 2. If there is a y and w in a string, then the first w ALWAYS occurs before the first y SOLUTION: • 1. Write down exs and counter exs • 2. Decide on any ambiguities 3.. Use Case Analysis to sub-divide the problem language = (a) strings of { w,x,z} UNION (b)strings of { w,y,z} with restriction 2. - Part (a): = [w x z]+ - Part (b): can assume y is always in a string = [y z]+ | z* w [wz]* y [x y z]* -. Put together answer = [w x z]+ | [y z]+ | z* w [wz]* y [x y z]*

A LEXICAL ANALYSER - GENERATOR (e.g. LEX, JLEX) - how they work • INPUT REGULAR EXPRESSIONS • TRANSLATE REGULAR EXPRESSION INTO NON-DETERMINISTIC FSM • TRANSLATE NON-DETERMINISTIC FSM INTO DETERMINISTIC FSM (which is easily described as a simple program)

EXAMPLE INPUT TOA LEXICAL ANALYSER - GENERATOR %% ";" { return new Symbol(sym.SEMI); } "+" { return new Symbol(sym.PLUS); } "*" { return new Symbol(sym.TIMES); } "(" { return new Symbol(sym.LPAREN); } ")" { return new Symbol(sym.RPAREN); } [0-9]+ { return new Symbol(sym.NUMBER, new Integer(yytext())); } [ \t\r\n\f] { /* ignore white space. */ } . { System.err.println("Illegal character: "+yytext()); } example; if string (231+3)*3 was input to the generated lexical analyser the output would be: LPAREN (NUMBER,231) PLUS (NUMBER,3) RPAREN TIMES (NUMBER,3)

{ for (;;) switch (next_char) { case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9': /* parse a decimal integer */ int i_val = 0; do { i_val = i_val * 10 + (next_char - '0'); advance(); } while (next_char >= '0' && next_char <= '9'); return new Symbol(sym.INT, new Integer(i_val)); case 'p': advance(); return new Symbol(sym.PRINT); case 'r': advance(); return new Symbol(sym.REPEAT); case 'u': advance(); return new Symbol(sym.UNTIL); case '=': advance(); return new Symbol(sym.ASSIGNS); case ';': advance(); return new Symbol(sym.SEMI); case '+': advance(); return new Symbol(sym.PLUS); case '-': advance(); return new Symbol(sym.MINUS); case '(': advance(); return new Symbol(sym.LPAREN); case ')': advance(); return new Symbol(sym.RPAREN); case 'x': advance(); return new Symbol(sym.ID,"x"); case 'y': advance(); return new Symbol(sym.ID,"y"); case 'z': advance(); return new Symbol(sym.ID,"z"); case -1: return new Symbol(sym.EOF); default: advance(); break; } } }; Simple Lexical Analyser public class scanner { protected static int next_char; protected static void advance() throws java.io.IOException { next_char = System.in.read(); } public static void init() throws java.io.IOException { advance(); } public static Symbol next_token() throws java.io.IOException

Introduction to Grammar Theory • Grammars can be used to generate the syntax of all formal languages – the structural complexity of a language is determined by the simplest grammar that can generate it. • In order to create parsers, we are interested in “properties of grammars”. For example, the “first set” of a string w of terminals and non-terminals is the set of TERMINAL symbols (tokens) that may be at the front of ANY string derived from w using the grammar rules.

Summary: • Regular expressions are a quick and easy way to specify simple forms of language. They can be easily translated into FSMs (which have nice properties e.g. they have linear time complexity in their execution) • There are tools (JLEX) which input regular expressions and output a lexical analyser which recognises the language they define.

LANGUAGE TRANSLATORS: WEEK 14

LANGUAGE TRANSLATORS: WEEK 14

Presentation Transcript

CSC 415: Translators and Compilers Spring 2009

Language and Intelligence

Language

PSYCHOLOGY OF LANGUAGE

Evolution of Language

Questions for the Week

Li2 Language variation

CSC 415: Translators and Compilers

Second Language Acquisition

Chapter 10: Language in Context

The Geography of Language

Language

IS 313 Today

GRS LX 700 Language Acquisition and Linguistic Theory

Programming Languages and Translators COMS W4115

周二课时考查课

Lecture 13

Language and lateralization

Figurative Language!

14:332:331 Computer Architecture and Assembly Language Spring 2005 Week 7

MGT 521 nerd Absolute Tutors / mgt521nerd.com