On this page:
4.1 What Are Regular Expressions?
4.2 Basic Concepts
4.2.1 Repetition in Regular Expressions
4.2.2 Character Classes
4.2.3 Summary of Regular Expression Syntax
4.2.4 Regular Expression Examples
4.2.5 Basic Functions
4.3 Extracting Substrings with Capturing Groups
8.17

4 OCaml Regular Expressions🔗

    4.1 What Are Regular Expressions?

    4.2 Basic Concepts

    4.3 Extracting Substrings with Capturing Groups

The String module in OCaml provides many useful functions for manipulating strings, including:
  • Concatenating two strings

  • Extracting substrings

  • Searching for substrings and replacing them with something else

4.1 What Are Regular Expressions?🔗

A regular expression is a pattern that describes a set of strings. Regular expressions are useful for:
  • Searching and matching text

  • Formally describing strings

  • Defining the symbols (lexemes or tokens) that make up a language

Regular expressions are common across many languages and tools, including sed, grep, awk, Perl, Python, and Ruby. They were popularized (and made fast) as a language feature in Perl and are based on elegant theoretical foundations that we will cover in a future lecture.

There exist multiple Regexp libraries. Most common one are RE and Str.
  • RE: a pure OCaml regular expressions library that supports several formats (glob, posix, str…) In this lecture, we will use the posix format of the RE library

  • Str: OCaml comes with the Str module. This module is not recommended because it is not particularly fast It does not support Unicode

Here is a basic example that uses the RE library.

OCaml REPL

#require "re";; (* only needed in utop *) 
 let str2re t = Re.Posix.compile (Re.Posix.re t);; 
 
 let r = str2re "[a-z][0-9]+";; 
 (* The pattern r matches strings that starts with a letter 
 followed by one or more digits *) 
 Re.matches r "a12#b22abcd";; 
 ;;

# # val str2re : string -> Re__.Core.re = <fun>
# val r : Re__.Core.re = <abstr>
# - : string list = ["a12"; "b22"]
# 

4.2 Basic Concepts🔗

Regular expressions (regex) provide a formal way of describing patterns in strings. They form the foundation for pattern matching, and we will implement them in our next project. Components of Regular Expressions A regular expression is constructed from two types of components:
  • Basic Building Blocks
    • Empty set (∅): Matches nothing

    • Empty string (ε): Matches an empty string

    • Symbols (character literals) from the alphabet: Any character from Σ (the alphabet)

  • Operations on Building Blocks
    • Union (|): Provides a choice between patterns

    • Concatenation (. or implicit): Joins patterns together

    • Kleene Star (*): Repeats a pattern zero or more times

  • Building Complex Patterns

    By applying these operations recursively, we can construct arbitrarily complex patterns from the basic building blocks. This compositional structure makes regular expressions both powerful and mathematically elegant.

4.2.1 Repetition in Regular Expressions🔗

The following are suffixes on a regular expression ‘e‘:

Pattern

 

Meaning

 

Example

 

Matches

e*

 

Zero or more occurrences of e

 

a*

 

"",a,aa,aaa, ...

e+

 

One or more occurrences of e (same as ee*)

 

a+

 

a,aa,aaa, ...

e?

 

Exactly zero or one e

 

a?

 

"",a

e{x}

 

Exactly x occurrences of e

 

a{3}

 

aaa

e{x,}

 

At least x occurrences of e

 

a{2,}

 

aa,aaa, ...

e{x,y}

 

At least x and at most y occurrences of e

 

a{2,4}

 

aa, aaa, aaaa

For exmaple:
  • ‘bc*‘ matches ‘"b"‘, ‘"bc"‘, ‘"bcc"‘, ...

  • ‘a+b*‘ matches ‘"a"‘, ‘"ab"‘, ‘"aa"‘, ‘"aab"‘, ‘"aabb"‘, ‘"aabbb"‘, ‘"aaa"‘, ...

Precedence

Best practice:Use parentheses to disambiguate. Note that parentheses also have another use: to extract matches (as we’ll see with capturing groups).

4.2.2 Character Classes🔗

Character classes allow you to match one character from a set:
  • [abcd]: Matches any of ‘{"a", "b", "c", "d"}‘ (can also be written as ‘(a|b|c|d)‘)

  • [a-zA-Z0-9]: Matches any upper- or lower-case letter or digit

  • [^0-9]: Matches any character except 0-9 (the ‘^‘ means "not" and must come first)

  • [\t\n ]: Matches tab, newline, or space

  • [a-zA-Z_\$][a-zA-Z_\$0-9]*: Matches Java identifiers ($ escaped)

Special Characters

Note: Languages like Ruby and Python provide additional special characters.

Be careful with characters that have different meanings in different contexts:
  • ^
    • Inside a character class: means "not" (e.g., ‘[^0-9]‘)

    • Outside a character class: means beginning of line

  • ( )
    • Inside character classes: literal characters - Note: ‘/(0..2)/‘ does NOT mean 012

    • Outside character classes: used for grouping

  • -
    • Inside character classes: range (e.g., ‘[a-z]‘ means a to z)

    • Outside regular expressions: subtraction

4.2.3 Summary of Regular Expression Syntax🔗

Let ‘re‘ represent an arbitrary pattern:
  • re – matches regexp ‘re‘

  • (re1|re2) – match either ‘re1‘ or ‘re2‘

  • (re)* – match 0 or more occurrences of ‘re‘

  • (re)+ – match 1 or more occurrences of ‘re‘

  • (re)? – match 0 or 1 occurrences of ‘re‘

  • (re){2} – match exactly two occurrences of ‘re‘

  • [a-z] – same as ‘(a|b|c|...|z)‘

  • [^0-9] – match any character that is not 0, 1, etc.

  • ^, $ – match start or end of string

Try out regular expressions at rubular.com

4.2.4 Regular Expression Examples🔗
4.2.5 Basic Functions🔗

4.3 Extracting Substrings with Capturing Groups🔗

Capturing groups allow you to extract specific parts of a matched string. The Re library remembers which strings matched the parenthesized parts of a regular expression. These parts can be referred to as groups.

Example: Capturing Groups


let r = str2re "^Min:([0-9]+) Max:([0-9]+)$";;
let t = Re.exec r "Min:50 Max:99";;   
let min = Re.Group.get t 1;;    (* 50 *)
let max = Re.Group.get t 2;;    (* 99 *)

Input:

Min:1 Max:27
Min:10 Max:30
Min:  11 Max: 30
Min: a Max: 24


Desired Output:

min=1 max=27
min=10 max=30

By using capturing groups with parentheses, you can extract the numeric values and reformat the output accordingly.