# Book:Bruce W. Watson/Taxonomies and Toolkits of Regular Language Algorithms

## Bruce W. Watson: Taxonomies and Toolkits of Regular Language Algorithms

Published $\text {1995}$, Eindhoven University of Technology

ISBN 90-386-0396-7

### Contents

Acknowledgements
I Prologue
1 Introduction
1.1 Problem statement
1.2 Structure of this dissertation
1.3 Intended audience
2 Mathematical preliminaries
2.1 Notations and conventions
2.2 Basic definitions
2.3 Strings and languages
2.4 Regular expressions
2.5 Trees
2.6 Finite automata and Moore machines
2.6.1 Definitions and properties involving $FA$s and $MM$s
2.6.2 Transformations on finite automata
2.6.2.1 Imperative implementations of some transformations
II The taxonomies
3 Constructing taxonomies
4 Keyword pattern matching algorithms
4.1 Introduction and related work
4.2 The problem and some naive solutions
4.2.1 The $(P_+)$ algorithms
4.2.1.1 The $(P_+, S_+)$ algorithm and its improvement
4.2.1.2 The $(P_+, S_-)$ algorithm
4.2.2 The $(S_-)$ algorithms
4.2.2.1 The $(S_-, P_+)$ algorithms
4.2.2.2 The $(S_-, P_-)$ algorithm
4.3 The Aho-Corasick algorithms
4.3.1 Algorithm detail $(AC)$
4.3.2 Method $(AC-OPT)$
4.3.3 A Moore machine approach to the $(AC-OPT)$ algorithm
4.3.4 Linear search
4.3.5 The Aho-Corasick failure function algorithm
4.3.6 The Knuth-Morris-Pratt algorithm
4.3.7 An alternative derivation of Moore machine $M_0$
4.4 The Commentz-Walter algorithms
4.4.1 Safe shift distances and predicate weakening
4.4.1.1 General weakening strategies
4.4.1.2 The $l = \varepsilon$ and the no-lookahead cases
4.4.2 A shift function without further weakening
4.4.3 Towards the CW and BM algorithms
4.4.4 A more easily precomputed shift function
4.4.5 The standard Commentz-Walter algorithm
4.4.6 A derivation of the Boyer-Moore algorithm
4.4.7 A weakened Boyer-Moore algorithm
4.4.8 Using the right lookahead symbol
4.5 The Boyer-Moore family of algorithms
4.5.1 Larger shifts without using match information
4.5.2 Making use of match information
4.6 Conclusions
5 A new RE pattern matching algorithm
5.1 Introduction
5.2 Problem specification and a simple first algorithm
5.2.1 A more practical algorithm using a finite automaton
5.3 Greater shift distances
5.3.1 A more efficient algorithm by computing a greater shift
5.3.2 Deriving a practical range predicate
5.4 Precomputation
5.4.1 Characterising the domains of functions $d_1$ and $d_2$
5.4.2 Precomputing function $t$
5.4.3 Precomputing functions $d_1$ and $d_2$
5.4.4 Precomputing function $f_r$
5.4.5 Precomputing sets $L_q$
5.4.6 Precomputing function $emm$
5.4.7 Precomputing function $st$ and languages $L'$ and $suff(L')$
5.4.8 Precomputing relation $Reach(M)$
5.4.9 Combining the precomputation algorithms
5.5 Specializing the pattern matching algorithm
5.6 The performance of the algorithm
5.7 Improving the algorithm
5.8 Conclusions
6 $FA$ construction algorithms
6.1 Introduction and related work
6.2 Items and an alternative definition of $RE$
6.3 A canonical construction
6.4 $\varepsilon$-free constructions
6.4.1 Filters
6.5 Encoding Construction $(REM-\varepsilon)$
6.5.1 Using begin-markers
6.5.2 Composing the subset construction
6.6 An encoding using derivatives
6.7 The dual constructions
6.8 Precomputing the auxiliary sets and Null
6.9 Constructions as imperative programs
6.9.1 The item set constructions
6.9.2 The Symnodes constructions
6.9.3 A dual construction
6.10 Conclusions
7 $DFA$ minimisation algorithms
71 Introduction
7.2 An algorithm due to Brzozowski
7.3 Minimization by equivalence of states
7.3.1 The equivalence relation on states
7.3.2 Distinguishability
7.3.3 An upperbound on the number of approximation steps
7.3.4 Characterizing the equivalence classes of $E$
7.4 Algorithms computing $E$, $D$, or $[Q]_e$
7.4.1 Computing $D$ and $E$ by layerwise approximations
7.4.2 Computing $D$, $E$, and $[Q]_e$ by unordered approximation
7.4.3 More efficiently computing $D$ and $E$ by unordered approximation
7.4.4 An algorithm due to Hopcroft and Ullman
7.4.5 Hopcroft's algorithm to compute $[Q]_e$ efficiently
7.4.6 Computing $(p, q) \in E$
7.4.7 Computing $E$ by approximation from below
7.5 Conclusions
III The implementations
8 Designing and implementing class libraries
8.1 Motivations for writing class libraries
8.2 Code sharing
8.2.1 Base classes versus templates
8.2.2 Composition versus protected inheritance
8.3 Coding conventions and performance issues
8.3.1 Performance tuning
8.4 Presentation conventions
9 SPARE Parts: String PAttern REcognition in C++
9.1 Introduction and related work
9.2 Using the toolkit
9.2.2 Alternative alphabets
9.3 Abstract pattern matchers
9.4 Concrete pattern matchers
9.4.1 The brute-force pattern matchers
9.4.2 The KMP pattern matcher
9.4.3 The AC pattern matchers
9.4.3.1 AC transition machines and auxiliary classes
9.4.4 The CW pattern matchers
9.4.4.1 Safe shifters and auxiliary functions
9.4.5 The BM pattern matchers
9.4.5.1 Safe shifters and auxiliary functions
9.4.6 Summary of user classes
9.5 Foundation classes
9.5.1 Miscellaneous
9.5.2 Arrays, sets, and maps
9.5.3 Tries and failure functions
9.6 Experiences and conclusions
9.7 Obtaining and compiling the toolkit
10 FIRE Lite: $FA$s and $RE$s in C++
10.1 Introduction and related work
10.1.1 Related toolkits
10.1.2 Advantages and characteristics of FIRE Lite
10.1.3 Future directions for the toolkit
10.2 Using the toolkit
10.3 The stricture of FIRE Lite
10.4 $RE$s and abstract $FA$s
10.5 Concrete $FA$s
10.5.1 Finite automata
10.5.2 $\varepsilon$-free finite automata
10.5.3 Deterministic finite automata
10.5.4 Abstract-states classes
10.5.5 Summary of concrete automata classes
10.6 Foundation classes
10.6.1 Character ranges
10.6.2 States, positions, nodes, and maps
10.6.3 Bit vectors and sets
10.6.4 Transitions
10.6.5 Relations
10.7 Experiences and conclusions
10.8 Obtaining and compiling FIRE Lite
11 DFA minimisation algorithms in FIRE Lite
11.1 Introduction
11.2 The algorithms
11.2.1 Brzozowski's algorithm
11.2.2 Equivalence relation algorithms
11.3 Foundation classes
11.4 Conclusions
IV The performance of the algorithms
12 Measuring the performance of algorithms
13 The performance of pattern matchers
13.1 Introduction and related work
13.2 The algorithms
13.3 Testing methodology
13.3.1 Test environment
13.3.2 Natural language test data
13.3.3 DNA sequence test data
13.4 results
13.4.1 Performance versus keyword set size
13.4.2 Performance versus minimum keyword length
13.4.3 Single-keywords
13.5 Conclusions and recommendations
14 The performance of FA construction algorithms
14.1 Introduction
14.2 The algorithms
14.3 Testing methodology
14.3.1 Test environment
14.3.2 Generating regular expressions
14.3.3 Generating input strings
14.1 Results
14.4.1 Construction times
14.4.2 Constructed automaton sizes
14.4.3 Single transition performance
14.5 Conclusions and recommendations
15 The performance of DFA minimization algorithms
15.1 Introduction
15.2 The algorithms
15.3 Testing methodology
15.4 Results
15.5 Conclusions and recommendations
V Epilogue
16 Conclusions
16.1 General conclusions
16.2 Chapter-specific conclusions
16.3 A personal perspective:
17 Challenges and open problems
References
Index
Summary
Samenvatting (Dutch summary)
Curriculum Vitae