Regular Expression is Accepted by Finite State Machine

Theorem
Let $R$ be a regular expression.

Then there exists a finite state machine $F$ such that its accepted language $\map L F$ is exactly $\map L F$, the language defined by $R$.

Proof
This proof proceeds by structural induction.

Case $1$
Let $R$ be the empty-set regular expression, $\O$.

Then:
 * $\map L R = \O$

Consider the finite state machine $F_\O$ defined as:


 * $F_\O = \struct {S_\O, A_\O, I_\O, \Sigma, T_\O}$

where:


 * $S_\O = \set {\mathsf {Rej} }$
 * $A_\O = \O$
 * $I_\O = \mathsf {Rej}$
 * $T_\O \tuple {s, \sigma} = \mathsf {Rej}$ for all $s \in S_\O, \sigma \in \Sigma$

This machine is always in a rejecting state and never leaves it.

So no word is in $\map L {F_\O}$.

Therefore:
 * $\map L {F_\O} = \O = \map L R$

Case $2$
Let $R$ be the empty-word regular expression, $\epsilon$.

Then:
 * $\map L R = \set {\sqbrk {} }$

Consider the finite state machine $F_\epsilon$ defined as:


 * $F_\epsilon = \struct {S_\epsilon, A_\epsilon, I_\epsilon, \Sigma, T_\epsilon}$

where:


 * $S_\epsilon = \set {\mathsf {Acc}, \mathsf {Rej} }$
 * $A_\epsilon = \set {\mathsf{Acc} }$
 * $I_\epsilon = \mathsf {Acc}$
 * $\map {T_\epsilon} {s, \sigma} = \mathsf {Rej}$ for all $s \in S_\epsilon, \sigma \in \Sigma$

This machine starts out in an accepting state.

So $\sqbrk {}$ (the empty word is in $\map L {F_\epsilon}$.

Furthermore, any symbol moves the machine to a rejecting state and never back.

So no other word is in $\map L {F_\epsilon}$.

Therefore:
 * $\map L {F_\epsilon} = \set {\sqbrk {} } = \map L R$

Case $3$
Let $R$ be a literal $\sigma$.

Then:
 * $\map L R = \set {\sqbrk \sigma}$

Consider the finite state machine $F_\sigma$ defined as:


 * $F_\sigma = \struct {S_\sigma, A_\sigma, I_\sigma, \Sigma, T_\sigma}$

where:


 * $S_\sigma = \set {\mathsf {Start}, \mathsf {Acc}, \mathsf {Rej} }$
 * $A_\sigma = \set {\mathsf {Acc} }$
 * $I_\sigma = \mathsf {Start}$
 * $\map {T_\sigma} {\mathsf {Start}, \sigma} = \mathsf {Acc}$
 * $\map {T_\sigma} {s', \sigma'} = \mathsf {Rej}$ for all other $s' \in S_\sigma, \sigma' \in \Sigma$

This machine starts out in a rejecting state.

So $\sqbrk {}$ (the empty word) is not in $\map L {F_\sigma}$.

After receiving the symbol $\sigma$ at the start, this machine moves to an accepting state.

So $\sqbrk \sigma$ is in $\map L {F_\sigma}$.

Any other initial symbol, and any symbol after the initial, moves the machine to a rejecting state and never back.

So no other word is in $\map L {F_\sigma}$.

Therefore:
 * $\map L {F_\sigma} = \set {\sqbrk \sigma} = \map L R$

Case $4$
Let $R$ be a concatenation:
 * $R = R_1 R_2$

By the induction hypothesis, there exist finite state machines:


 * $F_1 = \struct {S_1, A_1, I_1, \Sigma, T_1}: \map L {F_1} = \map L {R_1}$
 * $F_2 = \struct {S_2, A_2, I_2, \Sigma, T_2}: \map L {F_2} = \map L {R_2}$

Define a new finite state machine $F_c$ as:


 * $F_c = \struct {S_c, A_c, I_c, \Sigma, T_c}$

where:
 * $S_c = S_1 \times \powerset {S_2}$


 * $A_c = \set {\tuple {s_1, s_2}: s_1 \in S_1 \land s_2 \cap A_2 \ne \O}$


 * $I_c = \begin {cases} \tuple {I_1, \O} & : I_1 \notin A_1 \\ \tuple {I_1, \set {I_2} } & : I_1 \in A_1 \end {cases}$


 * $\map {T_c} {\tuple {s_1, s_2}, \sigma} = \begin {cases}

\ds \tuple {\map {T_1} {s_1, \sigma}, \bigcup_{s \mathop \in s_2} \set {\map {T_2} {s, \sigma} } } & : \map {T_1} {s_1, \sigma} \notin A_1 \\ \ds \tuple {\map {T_1} {s_1, \sigma}, \bigcup_{s \mathop \in s_2} \set {\map {T_2} {s, \sigma} } \cup \set {I_2} } & : \map {T_1} {s_1, \sigma} \in A_1 \end {cases}$

where:
 * $\times$ denotes the Cartesian Product
 * $\PP$ the Power Set.

This machine $F_c$ effectively simulates one copy of $F_1$ and any number of copies of $F_2$.

Every time the simulated $F_1$ encounters an accepting state, a new copy of $F_2$ is run.

The combined $F_c$ reaches an accepting state if any one of the simulated $F_2$s do.

Therefore, the language accepted by this state machine is the concatenation of the accepted languages of $F_1$ and $F_2$.

Case $5$
Let $R$ be an alternation:
 * $R = R_1 \mid R_2$

By the induction hypothesis, there exist finite state machines:


 * $F_1 = \struct {S_1, A_1, I_1, \Sigma, T_1}: \map L {F_1} = \map L {R_1}$
 * $F_2 = \struct {S_2, A_2, I_2, \Sigma, T_2}: \map L {F_2} = \map L {R_2}$

Define a new finite state machine $F_a$ as:


 * $F_a = \struct {S_a, A_a, I_a, \Sigma, T_a}$

where:


 * $S_a = S_1 \times S_2$


 * $A_a = \set {\tuple {s_1, s_2}: s_1 \in A_1 \lor s_2 \in A_2 }$


 * $I_a = \tuple {I_1, I_2}$


 * $\map {T_a} {\tuple {s_1, s_2}, \sigma} = \tuple {\map {T_1} {s_1, \sigma}, \map {T_2} {s_2, \sigma} }$

where $\times$ denotes the Cartesian Product.

This machine $F_a$ effectively simulates $F_1$ and $F_2$ in parallel.

$F_a$ reaches an accepting state if any one of the simulated machines do.

Therefore, the language accepted by this state machine is the union of the accepted languages of $F_1$ and $F_2$.

Case $6$
Let $R$ be a Kleene star:
 * $R = R_1^*$

By the induction hypothesis, there exists a finite state machine:


 * $F_1 = \struct {S_1, A_1, I_1, \Sigma, T_1}: \map L {F_1} = \map L {R_1}$

Define a new finite state machine $F_k$ as:


 * $F_k = \struct {S_k, A_k, I_k, \Sigma, T_k}$

where:


 * $S_k = \powerset {S_1}$


 * $A_k = \set {S \subseteq S_k : I_1 \in S}$


 * $I_k = \set {I_1}$


 * $\map {T_k} {s, \sigma} = \begin {cases}

\map {U_k} {S, \sigma} & : \map {U_k} {S, \sigma} \cap A_1 = \O \\ \map {U_k} {S, \sigma} \cup \set {I_1} & : \map {U_k} {S, \sigma} \cap A_1 \ne \O \end {cases}$


 * $\ds \map {U_k} {S, \sigma} = \bigcup_{s \mathop \in S} \set {\map {T_1} {s, \sigma} }$

where $\PP$ denotes the Power Set.

This machine $F_k$ effectively simulates any number of copies of $F_1$ simultaneously.

Every time any of the simulated machines reaches an accepting state, a new copy is run.

$F_k$ reaches an accepting state whenever $I_1$ is in its state.

This occurs in two situations:
 * at the beginning

and:
 * when any of the simulated machines reaches an accepting state.

Therefore, the language accepted by $F_k$ consists of arbitrary numbers of concatenations of strings accepted by $F_1$.

By structural induction, the result follows.