Regular Expression is Accepted by Finite State Machine

Theorem
Let $R$ be a regular expression.

Then there exists a finite state machine $F$ such that its accepted language $L\left({F}\right)$ is exactly $L\left({R}\right)$, the language defined by $R$.

Proof
This proof proceeds by structural induction.

Case 1. Assume $R$ is the empty-set regular expression, $\varnothing$.

Then $L\left({R}\right) = \varnothing$.

Consider the finite state machine $F_\varnothing$ defined as:


 * $ \displaystyle F_\varnothing = \left({ S_\varnothing, A_\varnothing, I_\varnothing, \Sigma, T_\varnothing }\right) $

where:


 * $ S_\varnothing = \left\{ { \mathsf{Rej} }\right\} $
 * $ A_\varnothing = \varnothing $
 * $ I_\varnothing = \mathsf{Rej} $
 * $ T_\varnothing \left({ s, \sigma }\right) = \mathsf{Rej} $ for all $ s \in S_\varnothing, \sigma \in \Sigma $

This machine is always in a rejecting state and never leaves it, so no word is in $L\left({ F_\varnothing }\right)$.

Therefore, $L\left({ F_\varnothing }\right) = \varnothing = L\left({R}\right)$.

Case 2. Assume $R$ is the empty-word regular expression, $\epsilon$.

Then $L\left({R}\right) = \left\{ { \left[\right] }\right\}$.

Consider the finite state machine $F_\epsilon$ defined as:


 * $ \displaystyle F_\epsilon = \left({ S_\epsilon, A_\epsilon, I_\epsilon, \Sigma, T_\epsilon }\right) $

where:


 * $ S_\epsilon = \left\{ { \mathsf{Acc}, \mathsf{Rej} }\right\} $
 * $ A_\epsilon = \left\{ { \mathsf{Acc} }\right\} $
 * $ I_\epsilon = \mathsf{Acc} $
 * $ T_\epsilon \left({ s, \sigma }\right) = \mathsf{Rej} $ for all $ s \in S_\epsilon, \sigma \in \Sigma $

This machine starts out in an accepting state, so $\left[\right]$ (the empty word) is in $L\left({ F_\epsilon }\right)$.

Furthermore, any symbol moves the machine to a rejecting state and never back, so no other word is in $L\left({ F_\epsilon }\right)$.

Therefore, $L\left({ F_\epsilon }\right) = \left\{ { \left[\right] }\right\} = L\left({R}\right)$.

Case 3. Assume $R$ is a literal $\sigma$.

Then $L\left({R}\right) = \left\{ { \left[{\sigma}\right] }\right\}$.

Consider the finite state machine $F_\sigma$ defined as:


 * $ \displaystyle F_\sigma = \left({ S_\sigma, A_\sigma, I_\sigma, \Sigma, T_\sigma }\right) $

where:


 * $ S_\sigma = \left\{ { \mathsf{Start}, \mathsf{Acc}, \mathsf{Rej} }\right\} $
 * $ A_\sigma = \left\{ { \mathsf{Acc} }\right\} $
 * $ I_\sigma = \mathsf{Start} $
 * $ T_\sigma \left({ \mathsf{Start}, \sigma }\right) = \mathsf{Acc} $
 * $ T_\sigma \left({ s', \sigma' }\right) = \mathsf{Rej} $ for all other $ s' \in S_\sigma, \sigma' \in \Sigma $

This machine starts out in a rejecting state, so $\left[\right]$ (the empty word) is not in $L\left({ F_\sigma }\right)$.

After receiving the symbol $\sigma$ at the start, this machine moves to an accepting state, so $\left[{\sigma}\right]$ is in $L\left({ F_\sigma }\right)$.

Any other initial symbol, and any symbol after the initial, moves the machine to a rejecting state and never back, so no other word is in $L\left({ F_\sigma }\right)$.

Therefore, $L\left({ F_\sigma }\right) = \left\{ { \left[{\sigma}\right] }\right\} = L\left({R}\right)$.

Case 4. Assume $R$ is a concatenation, $R_1 R_2$.

By the induction hypothesis, there exist finite state machines


 * $ \displaystyle F_1 = \left({ S_1, A_1, I_1, \Sigma, T_1 }\right) $ s.t. $ \displaystyle L\left({F_1}\right) = L\left({R_1}\right) $
 * $ \displaystyle F_2 = \left({ S_2, A_2, I_2, \Sigma, T_2 }\right) $ s.t. $ \displaystyle L\left({F_2}\right) = L\left({R_2}\right) $

Define a new finite state machine $F_c$ as:


 * $ \displaystyle F_c = \left({ S_c, A_c, I_c, \Sigma, T_c }\right) $

where:


 * $ S_c = S_1 \times \mathcal{P} \left({ S_2 }\right) $ where $\times$ denotes the Cartesian Product and $\mathcal{P}$ the Power Set
 * $ A_c = \left\{ { \left({ s_1, s_2 }\right) : s_1 \in S_1 \land s_2 \cap A_2 \neq \varnothing }\right\} $
 * $ I_c = \begin{cases} \left({ I_1, \varnothing }\right) & \mbox{if } I_1 \notin A_1 \\ \left({ I_1, \left\{ {I_2} \right\} }\right) & \mbox{if } I_1 \in A_1 \end{cases} $
 * $ \displaystyle T_c \left({ \left({ s_1, s_2 }\right), \sigma }\right) = \begin{cases} \left({ T_1 \left({ s_1, \sigma }\right), \bigcup_{s \in s_2} \left\{ { T_2 \left({ s, \sigma }\right) }\right\} }\right) & \mbox{if } T_1 \left({ s_1, \sigma }\right) \notin A_1 \\ \left({ T_1 \left({ s_1, \sigma }\right), \bigcup_{s \in s_2} \left\{ { T_2 \left({ s, \sigma }\right) }\right\} \cup \left\{ {I_2} \right\} }\right) & \mbox{if } T_1 \left({ s_1, \sigma }\right) \in A_1 \end{cases} $

This machine $F_c$ effectively simulates one copy of $F_1$ and any number of copies of $F_2$. Every time the simulated $F_1$ encounters an accepting state, a new copy of $F_2$ is run. The combined $F_c$ reaches an accepting state if any one of the simulated $F_2$s do.

Therefore, the language accepted by this state machine is the concatenation of the accepted languages of $F_1$ and $F_2$.

Case 5. Assume $R$ is an alternation, $R_1 \mid R_2$.

By the induction hypothesis, there exist finite state machines


 * $ \displaystyle F_1 = \left({ S_1, A_1, I_1, \Sigma, T_1 }\right) $ s.t. $ \displaystyle L\left({F_1}\right) = L\left({R_1}\right) $
 * $ \displaystyle F_2 = \left({ S_2, A_2, I_2, \Sigma, T_2 }\right) $ s.t. $ \displaystyle L\left({F_2}\right) = L\left({R_2}\right) $

Define a new finite state machine $F_a$ as:


 * $ \displaystyle F_a = \left({ S_a, A_a, I_a, \Sigma, T_a }\right) $

where:


 * $ S_a = S_1 \times S_2 $ where $\times$ denotes the Cartesian Product
 * $ A_a = \left\{ { \left({ s_1, s_2 }\right) : s_1 \in A_1 \lor s_2 \in A_2 }\right\} $
 * $ I_a = \left({ I_1, I_2 }\right) $
 * $ T_a \left({ \left({ s_1, s_2 }\right), \sigma }\right) = \left({ T_1 \left({ s_1, \sigma }\right), T_2 \left({ s_2, \sigma }\right) }\right) $

This machine $F_a$ effectively simulates $F_1$ and $F_2$ in parallel. $F_a$ reaches an accepting state if any one of the simulated machines do.

Therefore, the language accepted by this state machine is the union of the accepted languages of $F_1$ and $F_2$.

Case 6. Assume $R$ is a Kleene star, $ R_1^* $.

By the induction hypothesis, there exists a finite state machine


 * $ \displaystyle F_1 = \left({ S_1, A_1, I_1, \Sigma, T_1 }\right) $ s.t. $ \displaystyle L\left({F_1}\right) = L\left({R_1}\right) $

Define a new finite state machine $F_k$ as:


 * $ \displaystyle F_k = \left({ S_k, A_k, I_k, \Sigma, T_k }\right) $

where:


 * $ S_k = \mathcal{P} \left({ S_1 }\right) $ where $\mathcal{P}$ denotes the Power Set
 * $ A_k = \left\{ { S \subseteq S_k : I_1 \in S }\right\} $
 * $ I_k = \left\{ {I_1} \right\} $
 * $ \displaystyle T_k \left({ S, \sigma }\right) = \begin{cases} U_k \left({ S, \sigma }\right) & \mbox{if } U_k \left({ S, \sigma }\right) \cap A_1 = \varnothing \\ U_k \left({ S, \sigma }\right) \cup \left\{ {I_1} \right\}  & \mbox{if } U_k \left({ S, \sigma }\right) \cap A_1 \neq \varnothing \end{cases} $
 * $ \displaystyle U_k \left({ S, \sigma }\right) = \bigcup_{s \in S} \left\{ { T_1 \left({ s, \sigma }\right) }\right\} $

This machine $F_k$ effectively simulates any number of copies of $F_1$ simultaneously. Every time any of the simulated machines reaches an accepting state, a new copy run.

$F_k$ reaches an accepting state whenever $I_1$ is in its state. This occurs in two situations:
 * at the beginning; and
 * when any of the simulated machines reaches an accepting state.

Therefore, the language accepted by $F_k$ consists of arbitrary numbers of concatenations of strings accepted by $F_1$.

By structural induction, the result follows.