Nonregular Languages: Distinguishable Strings

It is easy to see that not all languages can be regular. Given an alphabet, there are uncountably many possible formal languages (possible sets of strings), but there are countably many regular expressions and hence countably many regular languages. That is, the number of possible languages is infinitely larger than the number of regular languages, and many (and in fact most) languages are not regular.

Intuitively, any problem where an automaton has to “count arbitrarily high” is not feasible with bounded memory, e.g., for L={ anbn  n0 }L = \{\ a^nb^n\ |\ n\geq 0\ \}, the automaton has to count arbitrarily many a’s to make sure there are the right number of b’s following.

Similarly, any problem where an automaton has to “remember arbitrarily long parts of the input” is not feasible with bounded memory, e.g., for LL being the set of palindromic strings of a’s and b’s, we must remember the entire first half in order to match it against the second half.

But intuition is not a proof that these languages aren’t regular. Maybe there’s some clever trick! After all, the language of strings whose length is a multiple of 3 might initially seem like we need to count the arbitrarily high number of input characters (in which case the language would not be regular), but a bit more thought shows we only need to keep track of the number of characters modulo 3, and the language is regular.

We now turn to the first of two methods for conclusively proving that a language is not regular, and no DFA for the langauge can possibly exist.

Definition

Given an arbitrary language LΣL \subseteq \Sigma^*, we say that strings w,vΣw, v \in \Sigma^* are distinguishable if there’s a suffix xΣx \in \Sigma^* (the distinguishing suffix) such that wxLwx \in L and vx∉Lvx \not\in L, or vice-versa.

Example

When L={ anbn  n0  }={ϵ,ab,aabb,aaabbb,}L = \{\ a^nb^n\ |\ n\geq 0\ \ \} = \{\epsilon, ab, aabb, aaabbb, \ldots\} ,

  • a is distinguishable from aa, because abb∉L\mathtt{a\underline{bb}}\not\in L but aabbL\mathtt{aa\underline{bb}} \in L (or because abL\mathtt{a\underline{b}}\in L but aab∉L\mathtt{aa\underline{b}} \not\in L)
  • a is distinguishable from aab, because aabbL\mathtt{a\underline{abb}} \in L but aababb∉L\mathtt{aab\underline{abb}} \not\in L .
  • aab isn’t distinguishable from aaabb, because aab ∉L\mathtt{aab\underline{~}} \not\in L and aaabb ∉L\mathtt{aaabb\underline{~}} \not\in L ; aaba∉L\mathtt{aab\underline{a}} \not\in L and aaabba∉L\mathtt{aaabb\underline{a}} \not\in L ; aabbL\mathtt{aab\underline{b}} \in L and aaabbbL\mathtt{aaabb\underline{b}} \in L ; aabaa∉L\mathtt{aab\underline{aa}} \not\in L and aaabbaa∉L\mathtt{aaabb\underline{aa}} \not\in L ; etc.

Example

If L={ 13i  i0  }L = \{\ 1^{3i}\ |\ i\geq 0\ \ \}, i.e., sequences of 1’s whose length is a multiple of 3, we have:

  • 1 is distinguishable from 11 (e.g., we can use the suffix 1);
  • 11 is distinguishable from 111 (e.g., we can use the suffix ϵ\epsilon);
  • 1 is distinguishable from 111 (e.g., we can use the suffix ϵ\epsilon);

Theorem

Let LL be a language, and let DD be a deterministic state machine for LL (finite or infinite).

If ww and vv take us to the same state in DD), they aren’t distinguishable. (because for any suffix xx, wxwx and vxvx lead to the same state in this deterministic machine and so they both accept or both reject.

Conversely, if ww and vv can be distinguished by at least one suffix xx, then it must be that ww and vv take us to different states in the machine.

Corollary

Let LL be a language.

If you can find a set of nn strings all distinguishable from each other,
then in any deterministic machine for LL these strings lead from start to nn different states (since otherwise they wouldn’t all be distinguishable); in this case, no deterministic machine for LL can have fewer than nn states.

And if you can find an infinite set of strings all distinguishable from each other,
then in any deterministic machine these strings all lead to different states,
so no finite deterministic machine (DFA) for LL can exist, so LL is not regular.

Note that the set of distinguishable strings can consist of arbitrary strings; they won’t necessarily be in the language LL, but they will typically be strings that are prefixes (initial segments) of strings in LL.

Example

The set L={ w{a,b}   w is a palindrome }L = \{\ w\in\{a,b\}^{*}\ |\ \ w \mathrm{\ is\ a\ palindrome}\ \} is not regular.

Proof:
Consider the infinite set of strings S:={  anbn  n0 }S := \{\ \ a^nb^n\ |\ n \geq 0\ \}.
To show: for all pairs of strings in SS, the strings are distinguishable.
Let ajbja^jb^j and akbka^kb^k (with jkj\not=k) be two arbitrary strings in SS.
Then ajbjbjajLa^jb^j\underline{b^ja^j}\in L but akbkbjaj∉La^kb^k\underline{b^ja^j} \not\in L, so they’re distinguishable.
Thus SS is an infinite set of pairwise-distinguishable strings, and LL is not regular.

Example

The set L={ anbn  n0  }L = \{\ a^nb^n\ |\ n\geq 0\ \ \} is not regular.

Proof:
Consider the infinite set of strings S:={  an  n0 }S := \{\ \ a^n\ |\ n \geq 0\ \}.
To show: for all pairs of strings in SS, the strings are distinguishable.
Let aja^j and aka^k (with jkj\not=k) be two arbitrary strings in SS.
Then ajbjLa^j\underline{b^j}\in L but akbj∉La^k\underline{b^j} \not\in L, so they’re distinguishable.
Thus SS is an infinite set of pairwise-distinguishable strings, and LL is not regular.

Conclusion

  • The Distinguishable-Strings method has the nice property that it (in principle) will always work; if a language is not regular, there exists an infinite set of pairwise-distinguishable strings that demonstrates the language is not regular.
  • When searching for a suitable infinite set of distinguishable strings, the elements of the set SS do not need to be members of the language LL (and quite often they’re not), but they do need to be prefixes of strings that appear in LL.