CS 3813: Introduction to Formal Languages and Automata

CS 3813: Introduction to Formal Languages and Automata Chapter 11 A Hierarchy of Formal Languages and Automata These class notes are based on material from our textbook, An Introduction to Formal Languages and Automata, 3rd ed., by Peter Linz, published by Jones and Bartlett Publishers, Inc., Sudbury, MA, 2001. They are intended for classroom use only and are not a substitute for reading the textbook.

Diagrams from some slides are from a previous year’s textbook: Martin, John C., Introduction to Languages and the Theory of Computation. Boston: WCG McGraw-Hill, 1991. Slides are for use of this class only.

Functions • A function is a mapping from a set of elements (called the domain) to another set of elements (called the range). • If the domain and range are the set of strings over an alphabet, we call it a string function. If the domain and range are the set of natural numbers, we call it a number-theoretic function. • Any natural number can be represented by a string. We will see later that any string can be represented by a natural number. (This will turn out to be important.)

Computability • A function is partial Turing computable if there is a TM that computes it and the TM stops on all inputs in the domain of the function (which may be a subset of all possible inputs) • A function is Turing computable if there is a TM that computes it and the TM stops on all inputs

Languages • Given the set of all possible strings over an alphabet, a language is a subset of this set. • A language can be represented by a characteristic function that has the set of all strings as its domain and {0, 1} as its range. It maps a string to 1 if it is in the language, and otherwise maps it to 0. • When we extend the concept of computability to languages, we usually call it “decidability.”

Decidability • A Turing computable language has a characteristic function that is Turing computable. A Turing computable language is also called a decidable language. • A semi-decidable language has a TM that outputs 1 (or equivalently, halts) for every input string in the language, and does not halt for any input string that is not in the language. • So, we talk about computability for functions, and decidability for languages. But it’s the same idea.

Review of definitions • A function can be Turing computable, partial Turing computable, or uncomputable. What are the differences? • A language can be decidable, semi-decidable, or undecidable. What are the differences?

Enumerability • A language is said to be Turing enumerable if there is a TM that lists all the strings of the language. (Note that the TM never terminates if the language is infinite.) • Some facts: • A language is Turing enumerable if and only if it is semi-decidable. • If a language and its complement are Turing enumerable, then the language is decidable. • If a language is decidable, then its complement is decidable.

Church-Turing thesis • This thesis (not theorem!) holds that any algorithmic procedure that can be performed by a human or computer can be performed by a TM. • It can’t be proved, but is widely believed. • First implication: instead of describing a TM in detail, we can describe a high-level algorithm and assume a TM can be described that computes it. • Second implication: if we can show that a problem cannot be solved by any TM, we may conclude that it can’t be solved by any computer.

Universal Turing machine • A TM that takes as input the description of a TM (a “program”) and an input string, simulates (“runs”) the TM on the input, and returns result. • Can be viewed as a programmable TM. • Equivalently, can be viewed as “interpreter” for TM programming language. Just as you can write an interpreter for C in C, you can construct a universal TM that is interpreter for TM programs. • Although Turing developed the concept of a universal TM for theoretical reasons, it helped stimulate the development of stored-program computers.

“Programming” a universal Turing machine • We can encode any TM as a unique string (or program) over some fixed alphabet, say {0,1}. • We can encode any input to the TM as a string over the same alphabet • There are many ways to do this and it doesn’t matter what method we use … what matters is that we can do this at all.

Important questions • How many Turing machines are there? • How many functions are there? • How many computable functions are there? • How many languages are there? • How many decidable languages are there? • We’ll come back to these questions later. To answer them, we first need to discuss what it means for a set to be countably or uncountably infinite. And for that, we begin with a review of set theory.

Review of set theory The cardinality of a set is the number of elements in a set. For example, Let S = {2, 4, 6}. Then |S| = 3. The powerset of S is the set of all subsets of S. For example, 2S = {{}, {2}, {4}, {6}, {2,4}, {2,6}, {4,6}, {2,4,6}}

The cardinality of powersets • We can use mathematical induction to prove that the cardinality of the powerset of a finite set, S, is 2|S|. What about a more difficult question: what is the cardinality of the powerset of an infinite set?

Countable sets • Two sets have the same cardinality if their elements can be put in 1-1 correspondence with each other • An infinite set is countable if its elements can be placed in 1-1 correspondence with the natural numbers, that is, if its elements can be listed sequentially. Basically, this amounts to being able to specify what the first element of the set is, what the second is, etc.

2 4 6 8 10 12 14 … 2n ... 1 2 3 4 5 6 7 … n ... The even, natural numbers are countable The set of even, natural numbers has the same cardinality as the set of natural numbers, although it is a strict subset of the set of natural numbers.

The integers are countable … -4 -3 -2 -1 0 1 2 3 4 ... … 9 7 5 3 1 2 4 6 8 ...

1/1 1/2 1/3 1/4 1/5 1/6 1/7 ... 2/1 2/2 2/3 2/4 2/5 2/6 ... 3/1 3/2 3/3 3/4 3/5 ... 4/1 4/2 4/3 4/4 … 5/1 5/2 5/3 … 6/1 6/2 … 7/1 ... The rational numbers are countable Here are the rational numbers: What is the first rational number? 1/1 What is the second rational number? 2/1 What is the third rational number? 1/2 What is the fourth rational number? 3/1 etc.

The real numbers are uncountable(Cantor’s diagonal argument) Assume the real numbers between 0 and 1 can be listed in order as infinite decimals. f0: 0. f0(0) f0(1) f0(2) f0(3) … f1: 0. f1(0) f1(1) f1(2) f1(3) … f2: 0. f2(0) f2(1) f2(2) f2(3) … f3: 0. f3(0) f3(1) f3(2) f3(3) … ... Consider the real number f defined as f(n) = fn(n) +1. Note that for every i, f(i) fi(i). Therefore f is not in list. This contradiction disproves the assumption that real numbers between 0 and 1 are countable.

The real numbers are uncountable • Didn’t get that? OK; let’s try again. • We can define the first real number. Let’s arbitrarily make 0.0 the first real number. That means that we can put it into one-to-one correspondence with the number 1. • real # 0.0 • counting # 1 • But now what is the second real number? • real # 0.0 X • counting # 1 2 • No matter what number we pick for X, we can always find another real number in between the previous real number and X. For example, we can divide X by 2. That gives us another real number in between 0.0 and X.

The real numbers are uncountable Since we cannot specify what the second, third, fourth, etc. elements of the set of real numbers are, the set of real numbers in uncountable, or uncountably infinite. Definition: A set is uncountably infinite if it is impossible to sequentially list its elements Georg Cantor used this argument to distinguish between different levels of infinity. א0 (aleph null) = infinity of integers א1 (aleph one) = infinity of real numbers

The powerset of an infinite set S is uncountable The proof is by contradiction using diagonalization. Assume an infinite set S is countable; this means that the subsets of S can be listed in sequence. Order the elements of S sequentially. Represent each subset of S by an infinite row of 0’s and 1’s, where 1 indicates that the corresponding element of S occurs in it. Element # of elements in the original set 1 2 3 4 5 6 ... S1: 1 0 1 1 0 1 … S2: 0 0 1 1 0 0 … S3: 1 1 1 0 0 1 … S4: 1 0 1 0 1 1 … …

The powerset of an infinite set S is uncountable Element # of elements in the original set 1 2 3 4 5 6 ... S1: 1 0 1 1 0 1 … S2: 0 0 1 1 0 0 … S3: 1 1 1 0 0 1 … S4: 1 0 1 0 1 1 … … Consider, Sx, a subset of S that differs from each of these at some point along the diagonal. It will be represented by: Sx: 0 1 0 1 … Note that Sx is a valid subset of S, but it is not identical to any of the subsets already listed. Its existence contradicts the assumption that the powerset of an infinite set is countable.

The powerset of an infinite set S is uncountable Don’t worry if you don’t get this right away; we will see this in more detail a few slides later on.

Formal languages and countability • A formal language is a set of strings over an alphabet. Is this set countably or uncountably infinite? • If the symbols of an alphabet are arranged in order, we can define a lexicographical ordering over the strings in any language over that alphabet. • “alphabetical order” is an example of a lexicographic ordering • What does this imply about the countability of the strings in any language?

Formal languages and countability • Answer: The number of strings in a language is countably infinite. • Proof: • Divide the strings of the language into subsets based on their length; i.e., put all strings of length 1 together, all strings of length 2 together, etc. • Within each set, put the strings in lexicographical order • Merge the subsets, preserving their order • Now put the strings into one-to-one correspondence with the counting numbers

Formal languages and countability • Example: L = ww, where  = {a, b} 1 aa 2 bb 3 aaaa 4 abab 5 baba 6 bbbb 7 aaaaaa . . . (the strings are listed in canonical order)

How many TMs are there? • Because we can encode each TM as a string over an alphabet, the number of possible TMs is countably infinite. • From this we may also conclude that the number of possible programs in any programming language is countably infinite.

Formal languages and countability (continued) • Any language over  is a subset of *. • How many possible languages over  are there? (In other words, how many subsets of * are there?)

How many languages are there? • Answer: There are an uncountably infinite number of languages • Proof: • Any language over  is a subset of * • * is an infinite set • The powerset of * is the number of subsets of * • The powerset of an infinite set is uncountable

How many TMs are there? • What does this imply about whether all languages are decidable?

How many TMs are there? • We have shown that: • The number of strings in a language is countably infinite • We can represent any Turing Machine as a string over the alphabet  = {0, 1} • Therefore, the number of TMs is countably infinite • But there are an uncountably infinite number of languages • Consequently, we cannot put the number of TMs into one-to-one correspondence with the number of languages

How many TMs are there? • This means that there are more languages than there are TMs. • Every TM accepts all and only the strings of one specific language. • Therefore, there must be some languages that cannot be recognized by any TM. • Next chapter will talk about specific languages that are not decidable (and specific functions that are not computable).

11.1: Recursive and recursively enumerable languages Remember that the strings that a TM accepts constitute the language of the Turing machine. We represent this as L(T). A Turing machine always accepts the words of its language by stopping in the halting state. However, it is allowed to reject strings that don’t belong to its language either by crashing (in a finite number of steps), or by looping forever.

Recursive and recursively enumerable languages Infinite loops are bad for us, because if we are trying to decide whether a string belongs to the language of a TM or not, we can’t tell after waiting a finite amount of time whether the TM is going to halt on the very next step, or whether it is going to go on forever. We would prefer to have our TMs crash to reject a string.

Recursive and recursively enumerable It turns out that these distinctions exactly correspond to the last two major classes of languages that we want to discuss in this course: Recursively enumerable = accepted by a TM that may loop (or may crash) to reject Recursive = accepted by a TM that always crashes to reject

Definition 11.1: If L S* is a language, then a Turing machine T with input alphabet S is said to accept L if L(T) = L. The Turing machine T recognizes or decides L if T computes the characteristic function L: S*  {0, 1}. In other words, T halts for every string x in S*, outputting a 1 if x L, and outputting a 0 otherwise.

Definitions: Definition 11.1: A language is recursively enumerable if there is a TM that accepts L. Definition 11.2: A language is recursive if there is a TM that recognizes L. This means that a language is recursive iff there exists a membership algorithm for it. Otherwise the language is recursively enumerable.

We also know: The set of recursive languages is a proper subset of the set of recursively enumerable languages.

Theorem: If L1 and L2 are recursively enumerable languages over S, then L1 L2 and L1 L2 are also recursively enumerable languages.

Theorem: If L1 and L2 are recursive languages over S, then L1 L2 and L1 L2 are also recursive languages. If L is a recursive language, then L is a recursive language. (the means “complement”). (Proof: Obviously, just change the output of the TM from 0 to 1.)

Theorem: If L is a recursively enumerable language, and L is also recursively enumerable, then L must be recursive. Another way to say this is that the only way that a language L and its complement L can both be recursively enumerable is if both are recursive. Think about this. This implies that the complement of a non-recursive recursively enumerable language is . . . what?

Theorem: The complement of a non-recursive recursively enumerable language is a language that is not recursively enumerable. This means that the language cannot be accepted by a Turing Machine.... .... which means that NO automaton can accept the language.

11.1: Enumerating a language Putting a set of strings in canonical order means listing the shortest strings first, and listing the strings of the same length alphabetically. So the set of strings {abb, a, ba, aa, b} would look like this in canonical order: {a, b, aa, ba, abb}. Enumerating a set means to list the elements of the set one at a time – to put them into one-to-one correspondence with the positive integers.

Theorem: A language L S* is recursively enumerable (that is, can be accepted by some TM) if and only if L can be enumerated by some TM. How would the TM do this?

Theorem: One way is to list every possible string in canonical order: {l, a, b, aa, ab, ba, bb, aab, …} Next, construct a universal TM that contains within it a simulation of the TM that accepts L. Have it write 0 on its on tape to start off. Now run the UTM on the strings. To avoid infinite loops, we make a series of passes:

Theorem: 1st pass: The UTM generates the string l and simulates one move of the TM on that input. 2nd pass: The UTM simulates two moves of the TM on the string l, then generates the string a and simulates one move of the TM on that input. 3rd pass: The UTM simulates three moves of the TM on the string l, two moves of the TM on the string a, then generates the string b and simulates one move of the TM on that input. . . . and so on.

Theorem: Whenever the TM accepts a string, the UTM writes the next integer on its tape. Every string that is accepted by the TM is accepted after a finite number of moves. You can see that eventually, after a finite series of moves, all the strings belonging to L will be accepted by the TM, and the UTM will have written a series of integers on its tape.

Another observation: Note that, for some languages, the TM may accept a longer string in fewer passes than a shorter string. However, if the language is recursive (not just recursively enumerable) then it turns out that all strings will be accepted in canonical order.

Theorem: L is recursive if and only if there is a TM that enumerates L in canonical order.

CS 3813: Introduction to Formal Languages and Automata