Regular Expression is Accepted by Finite State Machine

Theorem
Let $R$ be a regular expression.

Then there exists a finite state machine $F$ such that its accepted language $L \left({F}\right)$ is exactly $L \left({R}\right)$, the language defined by $R$.

Proof
This proof proceeds by structural induction.

Case $1$
Let $R$ be the empty-set regular expression, $\varnothing$.

Then:
 * $L \left({R}\right) = \varnothing$

Consider the finite state machine $F_\varnothing$ defined as:


 * $F_\varnothing = \left({S_\varnothing, A_\varnothing, I_\varnothing, \Sigma, T_\varnothing}\right)$

where:


 * $S_\varnothing = \left\{ { \mathsf{Rej} }\right\}$
 * $A_\varnothing = \varnothing$
 * $I_\varnothing = \mathsf{Rej}$
 * $T_\varnothing \left({ s, \sigma }\right) = \mathsf{Rej}$ for all $s \in S_\varnothing, \sigma \in \Sigma$

This machine is always in a rejecting state and never leaves it.

So no word is in $L \left({ F_\varnothing }\right)$.

Therefore:
 * $L \left({ F_\varnothing }\right) = \varnothing = L \left({R}\right)$

Case $2$
Let $R$ be the empty-word regular expression, $\epsilon$.

Then:
 * $L \left({R}\right) = \left\{ { \left[\right] }\right\}$

Consider the finite state machine $F_\epsilon$ defined as:


 * $F_\epsilon = \left({ S_\epsilon, A_\epsilon, I_\epsilon, \Sigma, T_\epsilon }\right)$

where:


 * $S_\epsilon = \left\{ { \mathsf{Acc}, \mathsf{Rej} }\right\}$
 * $A_\epsilon = \left\{ { \mathsf{Acc} }\right\}$
 * $I_\epsilon = \mathsf{Acc}$
 * $T_\epsilon \left({ s, \sigma }\right) = \mathsf{Rej}$ for all $s \in S_\epsilon, \sigma \in \Sigma$

This machine starts out in an accepting state.

So $\left[\right]$ (the empty word is in $L \left({ F_\epsilon }\right)$.

Furthermore, any symbol moves the machine to a rejecting state and never back.

So no other word is in $L \left({ F_\epsilon }\right)$.

Therefore:
 * $L \left({ F_\epsilon }\right) = \left\{ { \left[\right] }\right\} = L \left({R}\right)$

Case $3$
Let $R$ be a literal $\sigma$.

Then:
 * $L \left({R}\right) = \left\{ { \left[{\sigma}\right] }\right\}$

Consider the finite state machine $F_\sigma$ defined as:


 * $F_\sigma = \left({ S_\sigma, A_\sigma, I_\sigma, \Sigma, T_\sigma }\right)$

where:


 * $S_\sigma = \left\{ { \mathsf{Start}, \mathsf{Acc}, \mathsf{Rej} }\right\}$
 * $A_\sigma = \left\{ { \mathsf{Acc} }\right\}$
 * $I_\sigma = \mathsf{Start}$
 * $T_\sigma \left({ \mathsf{Start}, \sigma }\right) = \mathsf{Acc}$
 * $T_\sigma \left({ s', \sigma' }\right) = \mathsf{Rej}$ for all other $s' \in S_\sigma, \sigma' \in \Sigma$

This machine starts out in a rejecting state.

So $\left[{}\right]$ (the empty word) is not in $L \left({ F_\sigma }\right)$.

After receiving the symbol $\sigma$ at the start, this machine moves to an accepting state.

So $\left[{\sigma}\right]$ is in $L \left({ F_\sigma }\right)$.

Any other initial symbol, and any symbol after the initial, moves the machine to a rejecting state and never back.

So no other word is in $L \left({ F_\sigma }\right)$.

Therefore:
 * $L \left({ F_\sigma }\right) = \left\{ { \left[{\sigma}\right] }\right\} = L \left({R}\right)$

Case $4$
Let $R$ be a concatenation:
 * $R = R_1 R_2$

By the induction hypothesis, there exist finite state machines:


 * $F_1 = \left({ S_1, A_1, I_1, \Sigma, T_1 }\right): L \left({F_1}\right) = L \left({R_1}\right)$
 * $F_2 = \left({ S_2, A_2, I_2, \Sigma, T_2 }\right): L \left({F_2}\right) = L \left({R_2}\right)$

Define a new finite state machine $F_c$ as:


 * $F_c = \left({ S_c, A_c, I_c, \Sigma, T_c }\right) $

where:
 * $S_c = S_1 \times \mathcal P \left({ S_2 }\right)$


 * $A_c = \left\{ { \left({ s_1, s_2 }\right) : s_1 \in S_1 \land s_2 \cap A_2 \neq \varnothing }\right\}$


 * $I_c = \begin{cases} \left({ I_1, \varnothing }\right) & : I_1 \notin A_1 \\ \left({ I_1, \left\{ {I_2} \right\} }\right) & : I_1 \in A_1 \end{cases}$


 * $\displaystyle T_c \left({ \left({ s_1, s_2 }\right), \sigma }\right) = \begin{cases} \left({ T_1 \left({ s_1, \sigma }\right), \bigcup_{s \in s_2} \left\{ { T_2 \left({ s, \sigma }\right) }\right\} }\right) & : T_1 \left({ s_1, \sigma }\right) \notin A_1 \\ \left({ T_1 \left({ s_1, \sigma }\right), \bigcup_{s \in s_2} \left\{ { T_2 \left({ s, \sigma }\right) }\right\} \cup \left\{ {I_2} \right\} }\right) & : T_1 \left({ s_1, \sigma }\right) \in A_1 \end{cases}$

where:
 * $\times$ denotes the Cartesian Product
 * $\mathcal P$ the Power Set.

This machine $F_c$ effectively simulates one copy of $F_1$ and any number of copies of $F_2$.

Every time the simulated $F_1$ encounters an accepting state, a new copy of $F_2$ is run.

The combined $F_c$ reaches an accepting state if any one of the simulated $F_2$s do.

Therefore, the language accepted by this state machine is the concatenation of the accepted languages of $F_1$ and $F_2$.

Case $5$
Let $R$ be an alternation:
 * $R = R_1 \mid R_2$

By the induction hypothesis, there exist finite state machines:


 * $F_1 = \left({ S_1, A_1, I_1, \Sigma, T_1 }\right): L \left({F_1}\right) = L \left({R_1}\right)$
 * $F_2 = \left({ S_2, A_2, I_2, \Sigma, T_2 }\right): L \left({F_2}\right) = L \left({R_2}\right)$

Define a new finite state machine $F_a$ as:


 * $F_a = \left({ S_a, A_a, I_a, \Sigma, T_a }\right)$

where:


 * $S_a = S_1 \times S_2$


 * $A_a = \left\{ { \left({ s_1, s_2 }\right) : s_1 \in A_1 \lor s_2 \in A_2 }\right\}$


 * $I_a = \left({ I_1, I_2 }\right)$


 * $T_a \left({ \left({ s_1, s_2 }\right), \sigma }\right) = \left({ T_1 \left({ s_1, \sigma }\right), T_2 \left({ s_2, \sigma }\right) }\right)$

where $\times$ denotes the Cartesian Product.

This machine $F_a$ effectively simulates $F_1$ and $F_2$ in parallel.

$F_a$ reaches an accepting state if any one of the simulated machines do.

Therefore, the language accepted by this state machine is the union of the accepted languages of $F_1$ and $F_2$.

Case $6$
Let $R$ be a Kleene star:
 * $R = R_1^*$

By the induction hypothesis, there exists a finite state machine:


 * $F_1 = \left({ S_1, A_1, I_1, \Sigma, T_1 }\right): L \left({F_1}\right) = L \left({R_1}\right)$

Define a new finite state machine $F_k$ as:


 * $F_k = \left({ S_k, A_k, I_k, \Sigma, T_k }\right)$

where:


 * $S_k = \mathcal P \left({ S_1 }\right)$


 * $A_k = \left\{ { S \subseteq S_k : I_1 \in S }\right\}$


 * $I_k = \left\{ {I_1} \right\} $


 * $\displaystyle T_k \left({ S, \sigma }\right) = \begin{cases} U_k \left({ S, \sigma }\right) & : U_k \left({ S, \sigma }\right) \cap A_1 = \varnothing \\ U_k \left({ S, \sigma }\right) \cup \left\{ {I_1} \right\} & : U_k \left({ S, \sigma }\right) \cap A_1 \neq \varnothing \end{cases}$


 * $U_k \left({ S, \sigma }\right) = \bigcup_{s \in S} \left\{ { T_1 \left({ s, \sigma }\right) }\right\}$

where $\mathcal P$ denotes the Power Set.

This machine $F_k$ effectively simulates any number of copies of $F_1$ simultaneously.

Every time any of the simulated machines reaches an accepting state, a new copy is run.

$F_k$ reaches an accepting state whenever $I_1$ is in its state.

This occurs in two situations:
 * at the beginning

and:
 * when any of the simulated machines reaches an accepting state.

Therefore, the language accepted by $F_k$ consists of arbitrary numbers of concatenations of strings accepted by $F_1$.

By structural induction, the result follows.