5 Regular Expressions
Concatenating two strings
Extracting substrings
Searching for substrings and replacing them
Names like Steve, Stephen, Steven, Stefan, or Esteve
Words containing an even number of vowels
For these kinds of tasks, we need regular expressions.
5.1 What Are Regular Expressions?
Searching and matching text
Formally describing strings
Defining the symbols (lexemes or tokens) that make up a language
Regular expressions are widely supported in modern programming languages and tools. They were popularized—and optimized for performance—as a language feature in Perl, and they are grounded in elegant theoretical foundations, which we will explore in a future lecture.
RE: A pure OCaml regular expression library that supports multiple formats (glob, POSIX, Str, etc.). In this lecture, we will use the POSIX format of the RE library.
Str: The standard OCaml ‘Str‘ module. This module is generally **not recommended** because it is relatively slow and does **not support Unicode**.
OCaml REPL
#require "re";; (* only needed in utop *) let str2re t = Re.Posix.compile (Re.Posix.re t);; let r = str2re "[a-z][0-9]+";; (* The pattern r matches strings that starts with a letter followed by one or more digits *) Re.matches r "a12#b22abcd";; [Output] val str2re : string -> Re__.Core.re = <fun> val r : Re__.Core.re = <abstr> - : string list = ["a12"; "b22"]
5.2 Basic Concepts
Regular expressions (regex) provide a formal way of describing patterns in strings. They form the basis for pattern matching, which we will implement in our next project.
5.2.1 Components of Regular Expressions
- Basic Building Blocks
Empty set (∅): Matches nothing
Empty string (ε): Matches an empty string
Symbols (character literals) from the alphabet Σ: Matches any character from the alphabet
- Operations on Building Blocks
Union (|): Provides a choice between patterns
Concatenation (. or implicit): Joins patterns together
Kleene Star (*): Repeats a pattern zero or more times
Building Complex Patterns
By applying these operations recursively, we can construct arbitrarily complex patterns from the basic building blocks. This compositional structure makes regular expressions both powerful and mathematically elegant.
Examples:
"OCaml": Strings are matched exactly. It matches only "OCaml".
"a|b": A vertical bar separates alternatives (Boolean OR). It mathces either "a" or "b".
"ab*": A quantifier (?, *, +, {n}) after an element specifies how many times the element can repeat. ab* matches strings that start with "a", and followed by any number of "b" characters, for example "a","ab","abbbb"
5.2.2 Repetition in Regular Expressions
Pattern |
| Meaning |
| Example |
| Matches |
e* |
| Zero or more occurrences of e |
| a* |
| "",a,aa,aaa, ... |
e+ |
| One or more occurrences of e (same as ee*) |
| a+ |
| a,aa,aaa, ... |
e? |
| Zero or one occurrence of e |
| a? |
| "",a |
e{x} |
| Exactly x occurrences of e |
| a{3} |
| aaa |
e{x,} |
| At least x occurrences of e |
| a{2,} |
| aa,aaa, ... |
e{x,y} |
| At least x and at most y occurrences of e |
| a{2,4} |
| aa, aaa, aaaa |
bc* matches "b", "bc", "bcc", ...
a+b* matches "a", "ab", "aa", "aab", "aabb", "aabbb", "aaa", ...
Precedence:
(OCaml)* matches "", "OCaml", "OCamlOCaml", ...
OCaml* matches "OCam", "OCaml", "OCamlllll", ...
Best practice:Use parentheses to avoid ambiguity. Note that parentheses also have another purpose: they can be used to capture matches (as we will see with capturing groups).
5.2.3 Character Classes
[abcd]: Matches any of ‘{"a", "b", "c", "d"}‘ (can also be written as ‘(a|b|c|d)‘)
[a-zA-Z0-9]: Matches any uppercase letter, lowercase letter or digit
[^0-9]: Matches any character except 0-9 (the ^ means not and must come first)
[\t\n ]: Matches tab, newline, or space
[a-zA-Z_\$][a-zA-Z_\$0-9]*: Matches Java identifiers ($ escaped)
Special Characters
^: Beginning of line
$: End of line
\$: Literal dollar sign
Note: Languages like Ruby and Python provide additional special characters.
- ^
Inside a character class: means "not" (e.g., ‘[^0-9]‘)
Outside a character class: means beginning of line
- ( )
Inside character classes: literal characters - Note: ‘(0..2)‘ does NOT mean 012
Outside character classes: used for grouping
- -
Inside character classes: range (e.g., ‘[a-z]‘ means a to z)
Outside regular expressions: subtraction
5.2.4 Summary of Regular Expression Syntax
re – matches regexp re
(re1|re2) – match either re1 or re2
(re)* – match 0 or more occurrences of re
(re)+ – match 1 or more occurrences of re
(re)? – match 0 or 1 occurrences of re
(re){2} – match exactly two occurrences of re
[a-z] – same as (a|b|c|...|z)
[^0-9] – match any character that is not 0, 1, etc.
^, $ – match start or end of string
You can experiment with regular expressions at rubular.com.1
5.2.5 Regular Expression Examples
Now, we will learm regular expressions through practical examples.
Any string containing two consecutive ab
(ab){2}
Any string containing a or two consecutive b
a|bb
Contains sss or ccc
s{3}|c{3}Contains exactly 2 "b"s, not necessarily consecutive
^[^b]*b[^b]*b[^b]*$Starts with c, followed by one lowercase vowel, and ends with any number of lowercase letters
^c[aeiou][a-z]*$Starts with a and has exactly 0 or 1 letter after that
^a[A-Za-z]?$Only lowercase letters, in any amount, in alphabetic order
^a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*$Contains one or more ab or ba
(ab|ba)+Precisely steve, steven, or stephen
^ste(ve|ven|phen)$Even length string
^(..)*$Even number of lowercase vowels
^([^aeiou]*[aeiou][^aeiou]*[aeiou][^aeiou]* )*$Starts with anything but "b", followed by one or more "a"’s and then no other characters
^[^b]a+$Quiz 1: How many different strings could the regex ^Hello, Anyone awake?$ match?
Quiz 1: How many different strings could the regex ^Hello, Anyone awake?$ match?
Answer: 2. The ? makes the preceding e optional, so the regex matches "Hello, Anyone awak" and "Hello, Anyone awake".
Quiz 2: Which regex is not equivalent to the others?
Quiz 2: Which regex is not equivalent to the others?
Answer: ^c?m?s?c?$. This regex can match the empty string and strings of up to 4 characters, while the others each match exactly one character from {c, m, s}.
Quiz 3: Which string does not match the regex [a-z]{4}[0-9]{3}?
Quiz 3: Which string does not match the regex [a-z]{4}[0-9]{3}?
Answer: "cmsc\d\d\d". The backslash and d are literal characters, not digits, so the [0-9]{3} portion is not satisfied.
5.3 Basic Functionsin the Regular Expression Module Re
Module Re provides functions for creating and working with regular expressions, abstracting away the details of regular expression syntax. Below is the list core functions available:
matches: Returns all substrings in a string that match the regular expression as a list
compile: Converts a regular expression into a compiled object that can be used for matching
exec: Tests a string against a compiled regular expression and returns the captured groups, if any
split: Divides a string into segments using the regular expression as the separator
5.4 Extracting Substrings with Capturing Groups
In regular expressions, capturing groups are portions of the pattern enclosed in parentheses (). They let you extract specific parts of a matched string rather than just knowing that it matched. The Re library remembers which strings matched the parenthesized parts of a regular expression. These parts can be referred to as groups.
OCaml REPL
#require "re";; (* only needed in utop *) let str2re t = Re.Posix.compile (Re.Posix.re t);; let pattern = str2re "([0-9]{4})-([0-9]{2})-([0-9]{2})";; let text = "The date is 2026-02-23.";; let t = Re.exec pattern text;; Re.Group.get t 1;; (* year *) Re.Group.get t 2;; (* month *) Re.Group.get t 3;; (* day *) [Output] val str2re : string -> Re__.Core.re = <fun> val pattern : Re__.Core.re = <abstr> val text : string = "The date is 2026-02-23." val t : Re.Group.t = <abstr> - : string = "2026" - : string = "02" - : string = "23"
Example: Capturing Groups
let r = str2re "^Min:([0-9]+) Max:([0-9]+)$";;
let t = Re.exec r "Min:50 Max:99";;Ā Ā
let min = Re.Group.get t 1;; Ā (* 50 *)
let max = Re.Group.get t 2;; Ā (* 99 *)
Input:
Min:1 Max:27
Min:10 Max:30
Min: 11 Max: 30
Min: a Max: 24
Desired Output:
min=1 max=27
min=10 max=30By using capturing groups with parentheses, you can extract the numeric values and reformat the output accordingly.
Quiz 4: What is the output of the following code?
let r = str2re "([A-Z]+)"
let t = Re.exec r "HELP! Iām stuck"
Re.Group.get t 1
Quiz 4: What is the output of the following code?
let r = str2re "([A-Z]+)"
let t = Re.exec r "HELP! Iām stuck"
Re.Group.get t 1Answer: "HELP". The capturing group ([A-Z]+) greedily matches the longest run of uppercase letters at the start of "HELP! I’m stuck".
Quiz 5: What is the output of the following code?
let r = str2re "[0-9] ([A-Za-z]+).*([0-9])";;
let t = Re.exec r "Why was 6 afraid of 7?";;
Re.Group.get t 2
Quiz 5: What is the output of the following code?
let r = str2re "[0-9] ([A-Za-z]+).*([0-9])";;
let t = Re.exec r "Why was 6 afraid of 7?";;
Re.Group.get t 2Answer: "7". Group 2 captures the last ([0-9]), which matches "7" at the end of the substring "6 afraid of 7".
Quiz 6: What is the output of the following code?
let r = str2re "[A-Za-z]{2}";;
Re.matches r "Hello World";;
Quiz 6: What is the output of the following code?
let r = str2re "[A-Za-z]{2}";;
Re.matches r "Hello World";;Answer: ["He"; "ll"; "Wo"; "rl"]. The regex matches exactly 2 letters at a time non-overlappingly; single remaining letters like "o" and "d" are skipped.
Quiz 7: What is the output of the following code?
let r = str2re "[A-Za-z]+";;
Re.matches r "To be, or not to be!";;
Quiz 7: What is the output of the following code?
let r = str2re "[A-Za-z]+";;
Re.matches r "To be, or not to be!";;Answer: ["To"; "be"; "or"; "not"; "to"; "be"]. The regex matches runs of letters, stopping at non-letter characters like commas and !.
5.5 In Class Example
dune build
dune exec bin/main.exeThe default pattern ^[^b]*b[^b]*b[^b]*$ matches all words containing exactly two bs.
let () = let pattern = "^[^b]*b[^b]*b[^b]*$" in Hashtbl.iter (fun word _ -> Printf.printf "%s\n" word) (Regex.match_words pattern)
(* Lists all words from "war and peace" that match the pattern More documentation about the Re library: https://ocaml.org/p/re/latest/doc/Re/index.html *) (* Read a file line by line*) let read_lines file = let cwd = Sys.getcwd () in let file_name = cwd ^ "/" ^ file in let contents = In_channel.with_open_bin file_name In_channel.input_all in String.split_on_char '\n' contents let str2re t = Re.Posix.compile (Re.Posix.re t) (* Return all matched words in a Hash Table*) let match_words p = let lines = read_lines "war_peace.txt" in let delimiter = str2re "[<>/,?!. '-()\":;*]" in let pattern = str2re p in let h = Hashtbl.create 1331 in let _ = List.iter (fun line -> let words = Re.split delimiter line in (* split each line into a words array *) List.fold_left (fun _acc word -> let word = String.lowercase_ascii word in if List.length (Re.matches pattern word) > 0 then Hashtbl.replace h word true else ()) () words) lines in h