40 likes | 142 Vues
Regular Expressions (RE's) are a means of describing languages in finite terms. We aim to convert RE's into Deterministic Finite State Automaton (DFA) to recognize valid words in a language. Using JLex, the Java version of Lex, we can classify tokens efficiently. However, RE's have limitations, such as recognizing balanced parentheses, and we may need to use Context Free Grammars (CFG's) for more complex structures. CFG's define syntactic structure declaratively using productions. This text explains the transition from RE's to DFA and the role of CFG's in language recognition.
E N D
Regular Expressions (RE's)– Review • A means of describing a possibly infinite language in finite terms. • We aim to turn a RE into a Deterministic Finite State Automaton (DFA) • Steps: • 1: RE -> Non Deterministic Finite State Automaton (FSA) • 2: FSA -> DFA • 3: DFA -> minDFA • Aim is the create a mechanism to recognise valid words in a Language. • In our course it means recognising words like int, float, public etc. • These are called Tokens. • NB: Also it classifies the Tokens !!
JLex • Java version of Lex. • Given a file containing RE's and JLex macros (.lex file) • We run JLex over this .lex file and a .java file is produced. • We then call JLex to produce a Token by using next_token(). • No need to code the DFA ourselves, it is automatic, saves time.
Limitations of RE's • Say we define the following RE's: • digits = [0-9]+ • sum = (digits “+” )* digits • we can define sums like 3+78+9 etc. • If we have: • digits = [0-9]+ • sum = expr “+” expr • expr = “(“ sum “)” | digits • we can define (1+(5+8)) etc. • It is impossible for a RE to recognise balanced parenthesis. • A machine with only N states can onle recognise N levels of parenthesis nesting. • Therefore we need a new notation to represent the language above. • We move on to Context Free Grammars.
Context Free Grammars (CFG's) • RE's define lexcial structure declaratively. • Similarly CFG's define syntactic structure declaratively. • Definitions: • A langauge is a set of strings. • Each string is a finite sequence of symbols. • Symbols come from a finite alphabet. • CFG's describe languages and is formed of productions. • E.g. symbol -> sym1 sym2 sym3 ...... sym(N) • Symbols are either • 1: Terminal < -- > Token • 2: Non Terminal : Variable to denote a set of Strings.