Nonregular Languages: Distinguishable Strings

It is easy to see that not all languages can be regular. Given an alphabet, there are uncountably many possible formal languages (possible sets of strings), but there are countably many regular expressions and hence countably many regular languages. That is, the number of possible languages is infinitely larger than the number of regular languages, and many (and in fact most) languages are not regular.

Intuitively, any problem where an automaton has to “count arbitrarily high” is not feasible with bounded memory, e.g., for $L = \{\ a^nb^n\ |\ n\geq 0\ \}$ , the automaton has to count arbitrarily many a’s to make sure there are the right number of b’s following.

Similarly, any problem where an automaton has to “remember arbitrarily long parts of the input” is not feasible with bounded memory, e.g., for $L$ being the set of palindromic strings of a’s and b’s, we must remember the entire first half in order to match it against the second half.

But intuition is not a proof that these languages aren’t regular. Maybe there’s some clever trick! After all, the language of strings whose length is a multiple of 3 might initially seem like we need to count the arbitrarily high number of input characters (in which case the language would not be regular), but a bit more thought shows we only need to keep track of the number of characters modulo 3, and the language is regular.

We now turn to the first of two methods for conclusively proving that a language is not regular, and no DFA for the langauge can possibly exist.

Definition

Given an arbitrary language $L \subseteq \Sigma^*$ , we say that strings $w, v \in \Sigma^*$ are distinguishable if there’s a suffix $x \in \Sigma^*$ (the distinguishing suffix) such that $wx \in L$ and $vx \not\in L$ , or vice-versa.

Example

When $L = \{\ a^nb^n\ |\ n\geq 0\ \ \} = \{\epsilon, ab, aabb, aaabbb, \ldots\}$ ,

a is distinguishable from aa, because $\mathtt{a\underline{bb}}\not\in L$ but $\mathtt{aa\underline{bb}} \in L$ (or because $\mathtt{a\underline{b}}\in L$ but $\mathtt{aa\underline{b}} \not\in L$ )
a is distinguishable from aab, because $\mathtt{a\underline{abb}} \in L$ but $\mathtt{aab\underline{abb}} \not\in L$ .
aab isn’t distinguishable from aaabb, because $\mathtt{aab\underline{~}} \not\in L$ and $\mathtt{aaabb\underline{~}} \not\in L$ ; $\mathtt{aab\underline{a}} \not\in L$ and $\mathtt{aaabb\underline{a}} \not\in L$ ; $\mathtt{aab\underline{b}} \in L$ and $\mathtt{aaabb\underline{b}} \in L$ ; $\mathtt{aab\underline{aa}} \not\in L$ and $\mathtt{aaabb\underline{aa}} \not\in L$ ; etc.

Example

If $L = \{\ 1^{3i}\ |\ i\geq 0\ \ \}$ , i.e., sequences of 1’s whose length is a multiple of 3, we have:

1 is distinguishable from 11 (e.g., we can use the suffix 1);
11 is distinguishable from 111 (e.g., we can use the suffix $\epsilon$ );
1 is distinguishable from 111 (e.g., we can use the suffix $\epsilon$ );

Theorem

Let $L$ be a language, and let $D$ be a deterministic state machine for $L$ (finite or infinite).

If $w$ and $v$ take us to the same state in $D$ ), they aren’t distinguishable. (because for any suffix $x$ , $wx$ and $vx$ lead to the same state in this deterministic machine and so they both accept or both reject.

Conversely, if $w$ and $v$ can be distinguished by at least one suffix $x$ , then it must be that $w$ and $v$ take us to different states in the machine.

Corollary

Let $L$ be a language.

If you can find a set of $n$ strings all distinguishable from each other,
then in any deterministic machine for $L$ these strings lead from start to $n$ different states (since otherwise they wouldn’t all be distinguishable); in this case, no deterministic machine for $L$ can have fewer than $n$ states.

And if you can find an infinite set of strings all distinguishable from each other,
then in any deterministic machine these strings all lead to different states,
so no finite deterministic machine (DFA) for $L$ can exist, so $L$ is not regular.

Note that the set of distinguishable strings can consist of arbitrary strings; they won’t necessarily be in the language $L$ , but they will typically be strings that are prefixes (initial segments) of strings in $L$ .

Example

The set $L = \{\ w\in\{a,b\}^{*}\ |\ \ w \mathrm{\ is\ a\ palindrome}\ \}$ is not regular.

Proof:
Consider the infinite set of strings $S := \{\ \ a^nb^n\ |\ n \geq 0\ \}$ .
To show: for all pairs of strings in $S$ , the strings are distinguishable.
Let $a^jb^j$ and $a^kb^k$ (with $j\not=k$ ) be two arbitrary strings in $S$ .
Then $a^jb^j\underline{b^ja^j}\in L$ but $a^kb^k\underline{b^ja^j} \not\in L$ , so they’re distinguishable.
Thus $S$ is an infinite set of pairwise-distinguishable strings, and $L$ is not regular.

Example

The set $L = \{\ a^nb^n\ |\ n\geq 0\ \ \}$ is not regular.

Proof:
Consider the infinite set of strings $S := \{\ \ a^n\ |\ n \geq 0\ \}$ .
To show: for all pairs of strings in $S$ , the strings are distinguishable.
Let $a^j$ and $a^k$ (with $j\not=k$ ) be two arbitrary strings in $S$ .
Then $a^j\underline{b^j}\in L$ but $a^k\underline{b^j} \not\in L$ , so they’re distinguishable.
Thus $S$ is an infinite set of pairwise-distinguishable strings, and $L$ is not regular.

Conclusion

The Distinguishable-Strings method has the nice property that it (in principle) will always work; if a language is not regular, there exists an infinite set of pairwise-distinguishable strings that demonstrates the language is not regular.
When searching for a suitable infinite set of distinguishable strings, the elements of the set $S$ do not need to be members of the language $L$ (and quite often they’re not), but they do need to be prefixes of strings that appear in $L$ .