On this page:
5.1 What Are Regular Expressions?
5.2 Basic Concepts
5.2.1 Components of Regular Expressions
5.2.2 Repetition in Regular Expressions
5.2.3 Character Classes
5.2.4 Summary of Regular Expression Syntax
5.2.5 Regular Expression Examples
5.3 Basic Functionsin the Regular Expression Module Re
5.4 Extracting Substrings with Capturing Groups
5.5 In Class Example
9.0

5 Regular ExpressionsšŸ”—

    5.1 What Are Regular Expressions?

    5.2 Basic Concepts

    5.3 Basic Functionsin the Regular Expression Module Re

    5.4 Extracting Substrings with Capturing Groups

    5.5 In Class Example

The String module in OCaml offers many useful functions for working with strings, including:
  • Concatenating two strings

  • Extracting substrings

  • Searching for substrings and replacing them

However, what if we want to match more complex patterns? For example:
  • Names like Steve, Stephen, Steven, Stefan, or Esteve

  • Words containing an even number of vowels

For these kinds of tasks, we need regular expressions.

5.1 What Are Regular Expressions?šŸ”—

A regular expression is a pattern that describes a set of strings. Regular expressions are useful for:
  • Searching and matching text

  • Formally describing strings

  • Defining the symbols (lexemes or tokens) that make up a language

Regular expressions are widely supported in modern programming languages and tools. They were popularized—and optimized for performance—as a language feature in Perl, and they are grounded in elegant theoretical foundations, which we will explore in a future lecture.

There are several regular expression libraries available in OCaml, with the most common being RE and Str:
  • RE: A pure OCaml regular expression library that supports multiple formats (glob, POSIX, Str, etc.). In this lecture, we will use the POSIX format of the RE library.

  • Str: The standard OCaml ‘Str‘ module. This module is generally **not recommended** because it is relatively slow and does **not support Unicode**.

Here is a basic example that uses the RE library.

OCaml REPL

#require "re";; (* only needed in utop *) 
 let str2re t = Re.Posix.compile (Re.Posix.re t);; 
 
 let r = str2re "[a-z][0-9]+";; 
 (* The pattern r matches strings that starts with a letter 
 followed by one or more digits *) 
 Re.matches r "a12#b22abcd";;

[Output]
val str2re : string -> Re__.Core.re = <fun>
val r : Re__.Core.re = <abstr>
- : string list = ["a12"; "b22"]

5.2 Basic ConceptsšŸ”—

Regular expressions (regex) provide a formal way of describing patterns in strings. They form the basis for pattern matching, which we will implement in our next project.

5.2.1 Components of Regular ExpressionsšŸ”—

A regular expression is constructed from two types of components:
  • Basic Building Blocks
    • Empty set (∅): Matches nothing

    • Empty string (ε): Matches an empty string

    • Symbols (character literals) from the alphabet Σ: Matches any character from the alphabet

  • Operations on Building Blocks
    • Union (|): Provides a choice between patterns

    • Concatenation (. or implicit): Joins patterns together

    • Kleene Star (*): Repeats a pattern zero or more times

  • Building Complex Patterns

    By applying these operations recursively, we can construct arbitrarily complex patterns from the basic building blocks. This compositional structure makes regular expressions both powerful and mathematically elegant.

  • Examples:

    • "OCaml": Strings are matched exactly. It matches only "OCaml".

    • "a|b": A vertical bar separates alternatives (Boolean OR). It mathces either "a" or "b".

    • "ab*": A quantifier (?, *, +, {n}) after an element specifies how many times the element can repeat. ab* matches strings that start with "a", and followed by any number of "b" characters, for example "a","ab","abbbb"

5.2.2 Repetition in Regular ExpressionsšŸ”—

A quantifier (?, *, +, {n}) following an element specifies how many times that element can repeat. The table below shows the repetiton patterns for a regular expression e:

Pattern

 

Meaning

 

Example

 

Matches

e*

 

Zero or more occurrences of e

 

a*

 

"",a,aa,aaa, ...

e+

 

One or more occurrences of e (same as ee*)

 

a+

 

a,aa,aaa, ...

e?

 

Zero or one occurrence of e

 

a?

 

"",a

e{x}

 

Exactly x occurrences of e

 

a{3}

 

aaa

e{x,}

 

At least x occurrences of e

 

a{2,}

 

aa,aaa, ...

e{x,y}

 

At least x and at most y occurrences of e

 

a{2,4}

 

aa, aaa, aaaa

For exmaple:
  • bc* matches "b", "bc", "bcc", ...

  • a+b* matches "a", "ab", "aa", "aab", "aabb", "aabbb", "aaa", ...

Precedence:

  • (OCaml)* matches "", "OCaml", "OCamlOCaml", ...

  • OCaml* matches "OCam", "OCaml", "OCamlllll", ...

Best practice:Use parentheses to avoid ambiguity. Note that parentheses also have another purpose: they can be used to capture matches (as we will see with capturing groups).

5.2.3 Character ClassesšŸ”—

Character classes allow you to match one character from a set:
  • [abcd]: Matches any of ‘{"a", "b", "c", "d"}‘ (can also be written as ‘(a|b|c|d)‘)

  • [a-zA-Z0-9]: Matches any uppercase letter, lowercase letter or digit

  • [^0-9]: Matches any character except 0-9 (the ^ means not and must come first)

  • [\t\n ]: Matches tab, newline, or space

  • [a-zA-Z_\$][a-zA-Z_\$0-9]*: Matches Java identifiers ($ escaped)

Special Characters

  • ^: Beginning of line

  • $: End of line

  • \$: Literal dollar sign

Note: Languages like Ruby and Python provide additional special characters.

Be careful with characters that have different meanings in different contexts:
  • ^
    • Inside a character class: means "not" (e.g., ‘[^0-9]‘)

    • Outside a character class: means beginning of line

  • ( )
    • Inside character classes: literal characters - Note: ‘(0..2)‘ does NOT mean 012

    • Outside character classes: used for grouping

  • -
    • Inside character classes: range (e.g., ‘[a-z]‘ means a to z)

    • Outside regular expressions: subtraction

5.2.4 Summary of Regular Expression SyntaxšŸ”—

Let re represent an arbitrary pattern:
  • re – matches regexp re

  • (re1|re2) – match either re1 or re2

  • (re)* – match 0 or more occurrences of re

  • (re)+ – match 1 or more occurrences of re

  • (re)? – match 0 or 1 occurrences of re

  • (re){2} – match exactly two occurrences of re

  • [a-z] – same as (a|b|c|...|z)

  • [^0-9] – match any character that is not 0, 1, etc.

  • ^, $ – match start or end of string

You can experiment with regular expressions at rubular.com.1

5.2.5 Regular Expression ExamplesšŸ”—

Now, we will learm regular expressions through practical examples.

Any string containing two consecutive ab

(ab){2}

Any string containing a or two consecutive b

a|bb

Contains sss or ccc

s{3}|c{3}

Contains exactly 2 "b"s, not necessarily consecutive

^[^b]*b[^b]*b[^b]*$

Starts with c, followed by one lowercase vowel, and ends with any number of lowercase letters

^c[aeiou][a-z]*$

Starts with a and has exactly 0 or 1 letter after that

^a[A-Za-z]?$

Only lowercase letters, in any amount, in alphabetic order

^a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*$

Contains one or more ab or ba

(ab|ba)+

Precisely steve, steven, or stephen

^ste(ve|ven|phen)$

Even length string

^(..)*$

Even number of lowercase vowels

^([^aeiou]*[aeiou][^aeiou]*[aeiou][^aeiou]* )*$

Starts with anything but "b", followed by one or more "a"’s and then no other characters

^[^b]a+$

Quiz 1: How many different strings could the regex ^Hello, Anyone awake?$ match?

Answer: 2. The ? makes the preceding e optional, so the regex matches "Hello, Anyone awak" and "Hello, Anyone awake".

Quiz 2: Which regex is not equivalent to the others?

Answer: ^c?m?s?c?$. This regex can match the empty string and strings of up to 4 characters, while the others each match exactly one character from {c, m, s}.

Quiz 3: Which string does not match the regex [a-z]{4}[0-9]{3}?

Answer: "cmsc\d\d\d". The backslash and d are literal characters, not digits, so the [0-9]{3} portion is not satisfied.

5.3 Basic Functionsin the Regular Expression Module RešŸ”—

Module Re provides functions for creating and working with regular expressions, abstracting away the details of regular expression syntax. Below is the list core functions available:

  • matches: Returns all substrings in a string that match the regular expression as a list

  • compile: Converts a regular expression into a compiled object that can be used for matching

  • exec: Tests a string against a compiled regular expression and returns the captured groups, if any

  • split: Divides a string into segments using the regular expression as the separator

5.4 Extracting Substrings with Capturing GroupsšŸ”—

In regular expressions, capturing groups are portions of the pattern enclosed in parentheses (). They let you extract specific parts of a matched string rather than just knowing that it matched. The Re library remembers which strings matched the parenthesized parts of a regular expression. These parts can be referred to as groups.

Example:

OCaml REPL

#require "re";; (* only needed in utop *) 
 let str2re t = Re.Posix.compile (Re.Posix.re t);; 
 let pattern = str2re "([0-9]{4})-([0-9]{2})-([0-9]{2})";; 
 let text = "The date is 2026-02-23.";; 
 let t = Re.exec pattern text;; 
 Re.Group.get t 1;; (* year *) 
 Re.Group.get t 2;; (* month *) 
 Re.Group.get t 3;; (* day *)

[Output]
val str2re : string -> Re__.Core.re = <fun>
val pattern : Re__.Core.re = <abstr>
val text : string = "The date is 2026-02-23."
val t : Re.Group.t = <abstr>
- : string = "2026"
- : string = "02"
- : string = "23"

Example: Capturing Groups


let r = str2re "^Min:([0-9]+) Max:([0-9]+)$";;
let t = Re.exec r "Min:50 Max:99";;Ā  Ā 
let min = Re.Group.get t 1;; Ā   (* 50 *)
let max = Re.Group.get t 2;; Ā   (* 99 *)

Input:

Min:1 Max:27
Min:10 Max:30
Min:  11 Max: 30
Min: a Max: 24


Desired Output:

min=1 max=27
min=10 max=30

By using capturing groups with parentheses, you can extract the numeric values and reformat the output accordingly.

Quiz 4: What is the output of the following code?

let r = str2re "([A-Z]+)"
let t = Re.exec r "HELP! I’m stuck"
Re.Group.get t 1

Answer: "HELP". The capturing group ([A-Z]+) greedily matches the longest run of uppercase letters at the start of "HELP! I’m stuck".

Quiz 5: What is the output of the following code?

let r = str2re "[0-9] ([A-Za-z]+).*([0-9])";;
let t = Re.exec r "Why was 6 afraid of 7?";;
Re.Group.get t 2

Answer: "7". Group 2 captures the last ([0-9]), which matches "7" at the end of the substring "6 afraid of 7".

Quiz 6: What is the output of the following code?

let r = str2re "[A-Za-z]{2}";;
Re.matches r "Hello World";;

Answer: ["He"; "ll"; "Wo"; "rl"]. The regex matches exactly 2 letters at a time non-overlappingly; single remaining letters like "o" and "d" are skipped.

Quiz 7: What is the output of the following code?

let r = str2re "[A-Za-z]+";;
Re.matches r "To be, or not to be!";;

Answer: ["To"; "be"; "or"; "not"; "to"; "be"]. The regex matches runs of letters, stopping at non-letter characters like commas and !.

5.5 In Class ExamplešŸ”—

This example searches War and Peace for words that match a regex pattern using OCaml’s Re.Posix library. Words are lowercased and deduplicated before matching, and delimiters (spaces, punctuation, etc.) are used to split lines into words.
To try a different pattern, edit the pattern string in bin/main.ml, then build and run with:

dune build
dune exec bin/main.exe

The default pattern ^[^b]*b[^b]*b[^b]*$ matches all words containing exactly two bs.

regex/bin/main.ml

let () =
  let pattern = "^[^b]*b[^b]*b[^b]*$" in
  Hashtbl.iter
    (fun word _ -> Printf.printf "%s\n" word)
    (Regex.match_words pattern)

regex/lib/regex.ml

(*
  Lists all words from "war and peace" that match the pattern

  More documentation about the Re library: 
  https://ocaml.org/p/re/latest/doc/Re/index.html
*)

(* Read a file line by line*)
let read_lines file =
  let cwd = Sys.getcwd () in
  let file_name = cwd ^ "/" ^ file in
  let contents = In_channel.with_open_bin file_name In_channel.input_all in
  String.split_on_char '\n' contents

let str2re t = Re.Posix.compile (Re.Posix.re t)

(* Return all matched words in a Hash Table*)
let match_words p =
  let lines = read_lines "war_peace.txt" in
  let delimiter = str2re "[<>/,?!. '-()\":;*]" in
  let pattern = str2re p in
  let h = Hashtbl.create 1331 in
  let _ =
    List.iter
      (fun line ->
        let words = Re.split delimiter line in
        (* split each line into a words array *)
        List.fold_left
          (fun _acc word ->
            let word = String.lowercase_ascii word in
            if List.length (Re.matches pattern word) > 0 then
              Hashtbl.replace h word true
            else ())
          () words)
      lines
  in
  h