Regular Expressions Introduction
About 933 wordsAbout 12 min
2025-08-07
Basic Definitions
Alphabet and Regular Expressions
- Alphabet (Σ): A finite set of symbols
- Regular Expression: Built from alphabet symbols using:
- Union (+): r1+r2 matches either r1 or r2
- Concatenation (⋅): r1⋅r2 matches r1 followed by r2
- Kleene Star (∗): r∗ matches zero or more repetitions of r
Formal Definition
A regular expression over alphabet Σ is defined recursively:
Base Cases:
- ∅ is a regular expression denoting the empty language
- ϵ is a regular expression denoting {ϵ}
- For each a∈Σ, a is a regular expression denoting {a}
Inductive Cases:
- If r1 and r2 are regular expressions:
- (r1+r2) is a regular expression
- (r1⋅r2) is a regular expression
- (r1∗) is a regular expression
- If r1 and r2 are regular expressions:
Precedence Rules
When parentheses are omitted, the following precedence applies (highest to lowest):
- Kleene Star (∗)
- Concatenation (⋅)
- Union (+)
Tips
Always use parentheses to clarify precedence when combining operators
Examples
Simple Regular Expressions
| Regular Expression | Language Description | Example Strings |
|---|---|---|
| a | Single 'a' | "a" |
| a+b | 'a' or 'b' | "a", "b" |
| a⋅b | 'a' followed by 'b' | "ab" |
| a∗ | Zero or more 'a's | "", "a", "aa", "aaa", ... |
| (a+b)∗ | Any combination of 'a' and 'b' | "", "a", "b", "aa", "ab", "ba", "bb", ... |
More Complex Examples
Regular Expression: (a+b)∗⋅a⋅(a+b)∗
This describes all strings over {a,b} that contain at least one 'a'.
- (a+b)∗ matches any prefix (including empty)
- a matches exactly one 'a'
- (a+b)∗ matches any suffix (including empty)
Fundamental Properties
Union Properties
- Commutative: r1+r2=r2+r1
- Associative: (r1+r2)+r3=r1+(r2+r3)
- Identity: r+∅=r
Concatenation Properties
- Associative: (r1⋅r2)⋅r3=r1⋅(r2⋅r3)
- Identity: r⋅ϵ=ϵ⋅r=r
- Annihilator: r⋅∅=∅⋅r=∅
Kleene Star Properties
- Basic: ∅∗=ϵ∗=ϵ
- Iteration: (r∗)∗=r∗
- Closure: ϵ+r⋅r∗=r∗
Language Operations
Important Language Classes
Finite Languages: Can be expressed using union
- L={w1,w2,...,wn}=w1+w2+...+wn
Infinite Languages: Require Kleene star
- L={an∣n≥0}=a∗
- L={an∣n≥1}=a⋅a∗
Concatenation of Languages
- If L1={ab,a} and L2={b,bc}
- Then L1⋅L2={abb,abbc,ab,abc}
Practical Considerations
Shorthand Notation
In practice, we often use:
- ab instead of a⋅b
- r+ for one or more repetitions: r+=r⋅r∗
- r? for optional elements: r?=ϵ+r
Common Patterns
| Pattern | Meaning | Example |
|---|---|---|
| a∗ | Zero or more 'a's | "", "a", "aa", ... |
| a+ | One or more 'a's | "a", "aa", "aaa", ... |
| a? | Zero or one 'a' | "", "a" |
| an | Exactly n 'a's | "a" (when n=1), "aa" (when n=2) |
Summary
Regular expressions provide a powerful and concise way to describe patterns in strings. By combining the three fundamental operations (union, concatenation, and Kleene star), we can express both finite and infinite languages with remarkable precision.
The key insight is that regular expressions correspond to the class of regular languages, which can be recognized by finite automata, forming an important equivalence in automata theory.
Kleene’s Theorem (overview)
Kleene’s Theorem states that a language is regular iff it is recognized by some finite automaton (DFA/NFA) iff it can be described by a regular expression. We use this equivalence throughout (regex→NFA via Thompson, NFA→DFA via subset construction, DFA→regex via GNFA state elimination).
Problem-Solving Playbook
Clarify the alphabet and constraints
- List what is allowed/forbidden.
- Identify must-have substrings or positional constraints.
Choose building blocks
- Fixed pieces (literals), choices (union), repetition (Kleene star/plus), and optional parts.
Compose with precedence in mind
- Use parentheses to group; remember: Star > Concatenation > Union.
Sanity-check with examples and non-examples
- Generate a few strings that should match and should not match.
Simplify algebraically
- Apply idempotence, identities, and distributive laws to get a cleaner expression.
Worked Examples
Goal: Strings over {a,b} that end with ab or ba.
- Build endings:
ab + ba - Allow any prefix:
(a + b)^*
Final: (a + b)^* (ab + ba)
Examples: ab, ba, aab, baba, abba
Non-examples: a, b, aa, bbb
No further algebraic simplification changes the language meaningfully here. Keep as (a + b)^* (ab + ba).
Parse tree clarifies precedence; postfix (reverse Polish) simplifies Thompson’s construction.
Example: (a + b) a^* → postfix ab+ a* ·.
Idea: any prefix, then an a, then any suffix.
Regex: (a + b)^* a (a + b)^*
Pattern of pairs of a possibly separated by b’s: (b^* a b^* a)^* b^*
Allow any number of a’s between the three b’s: a^* b a^* b a^* b a^*
Mini Exercises (with hints)
Contains substringaba and does not end with bb
Hint: Use (a + b)^* aba (a + b)^* and restrict the tail with union excluding bb.
All strings over {0,1} with no11 as a substring
Hint: Think of runs of 0s separated by single 1s: 0^* (10^+ )^* 1?
Strings over {a,b} where everyb is immediately followed by a
Hint: Treat ba as an atomic unit. Also allow standalone a.
Tips
When designing a regex, write 3–5 positive and 3–5 negative examples first. This quickly reveals missing edge cases.
Changelog
87c17-web-deploy(Auto): Update base URL for web-pages branchon
Copyright
Copyright Ownership:WARREN Y.F. LONG
License under:Attribution-NonCommercial-NoDerivatives 4.0 International (CC-BY-NC-ND-4.0)