4 OCaml Regular Expressions

8.17

contents ← prev up next →

4 OCaml Regular Expressions🔗

4.1 What Are Regular Expressions?

4.2 Basic Concepts

4.3 Extracting Substrings with Capturing Groups

In-class Example Code

The String module in OCaml provides many useful functions for manipulating strings, including:

Concatenating two strings
Extracting substrings
Searching for substrings and replacing them with something else

4.1 What Are Regular Expressions?🔗

A regular expression is a pattern that describes a set of strings. Regular expressions are useful for:

Searching and matching text
Formally describing strings
Defining the symbols (lexemes or tokens) that make up a language

Regular expressions are common across many languages and tools, including sed, grep, awk, Perl, Python, and Ruby. They were popularized (and made fast) as a language feature in Perl and are based on elegant theoretical foundations that we will cover in a future lecture.

There exist multiple Regexp libraries. Most common one are RE and Str.

RE: a pure OCaml regular expressions library that supports several formats (glob, posix, str…) In this lecture, we will use the posix format of the RE library
Str: OCaml comes with the Str module. This module is not recommended because it is not particularly fast It does not support Unicode

Here is a basic example that uses the RE library.

OCaml REPL

#require "re";; (* only needed in utop *) 
 let str2re t = Re.Posix.compile (Re.Posix.re t);; 
 
 let r = str2re "[a-z][0-9]+";; 
 (* The pattern r matches strings that starts with a letter 
 followed by one or more digits *) 
 Re.matches r "a12#b22abcd";; 
 ;;

# # val str2re : string -> Re__.Core.re = <fun>
# val r : Re__.Core.re = <abstr>
# - : string list = ["a12"; "b22"]
#

4.2 Basic Concepts🔗

Regular expressions (regex) provide a formal way of describing patterns in strings. They form the foundation for pattern matching, and we will implement them in our next project. Components of Regular Expressions A regular expression is constructed from two types of components:

Basic Building Blocks
- Empty set (∅): Matches nothing
- Empty string (ε): Matches an empty string
- Symbols (character literals) from the alphabet: Any character from Σ (the alphabet)
Operations on Building Blocks
- Union (|): Provides a choice between patterns
- Concatenation (. or implicit): Joins patterns together
- Kleene Star (*): Repeats a pattern zero or more times
Building Complex Patterns
By applying these operations recursively, we can construct arbitrarily complex patterns from the basic building blocks. This compositional structure makes regular expressions both powerful and mathematically elegant.

"OCaml": Strings are matched exactly
"a|b": A vertical bar separates alternatives (Boolean OR)
"ab*": A quantifier (‘?‘, ‘*‘, ‘+‘, ‘{n}‘) after an element specifies how many times the element is allowed to repeat

4.2.1 Repetition in Regular Expressions🔗

The following are suffixes on a regular expression ‘e‘:

Pattern	Meaning	Example	Matches
e*	Zero or more occurrences of e	a*	"",a,aa,aaa, ...
e+	One or more occurrences of e (same as ee*)	a+	a,aa,aaa, ...
e?	Exactly zero or one e	a?	"",a
e{x}	Exactly x occurrences of e	a{3}	aaa
e{x,}	At least x occurrences of e	a{2,}	aa,aaa, ...
e{x,y}	At least x and at most y occurrences of e	a{2,4}	aa, aaa, aaaa

For exmaple:

‘bc*‘ matches ‘"b"‘, ‘"bc"‘, ‘"bcc"‘, ...
‘a+b*‘ matches ‘"a"‘, ‘"ab"‘, ‘"aa"‘, ‘"aab"‘, ‘"aabb"‘, ‘"aabbb"‘, ‘"aaa"‘, ...

Precedence

(OCaml)* means "", "OCaml", "OCamlOCaml", ...
OCaml* means "OCam", "OCaml", "OCamlllll", ...

Best practice:Use parentheses to disambiguate. Note that parentheses also have another use: to extract matches (as we’ll see with capturing groups).

4.2.2 Character Classes🔗

Character classes allow you to match one character from a set:

[abcd]: Matches any of ‘{"a", "b", "c", "d"}‘ (can also be written as ‘(a|b|c|d)‘)
[a-zA-Z0-9]: Matches any upper- or lower-case letter or digit
[^0-9]: Matches any character except 0-9 (the ‘^‘ means "not" and must come first)
[\t\n ]: Matches tab, newline, or space
[a-zA-Z_\$][a-zA-Z_\$0-9]*: Matches Java identifiers ($ escaped)

Special Characters

^: Beginning of line
$: End of line
\$: Literal dollar sign

Note: Languages like Ruby and Python provide additional special characters.

Be careful with characters that have different meanings in different contexts:

^
- Inside a character class: means "not" (e.g., ‘[^0-9]‘)
- Outside a character class: means beginning of line
( )
- Inside character classes: literal characters - Note: ‘/(0..2)/‘ does NOT mean 012
- Outside character classes: used for grouping
-
- Inside character classes: range (e.g., ‘[a-z]‘ means a to z)
- Outside regular expressions: subtraction

4.2.3 Summary of Regular Expression Syntax🔗

Let ‘re‘ represent an arbitrary pattern:

re – matches regexp ‘re‘
(re1|re2) – match either ‘re1‘ or ‘re2‘
(re)* – match 0 or more occurrences of ‘re‘
(re)+ – match 1 or more occurrences of ‘re‘
(re)? – match 0 or 1 occurrences of ‘re‘
(re){2} – match exactly two occurrences of ‘re‘
[a-z] – same as ‘(a|b|c|...|z)‘
[^0-9] – match any character that is not 0, 1, etc.
^, $ – match start or end of string

Try out regular expressions at rubular.com

4.2.4 Regular Expression Examples🔗

Any string containing two consecutive ab
(ab){2}
Any string containing a or two consecutive b
a|bb
Contains sss or ccc
s{3}|c{3}
Contains exactly 2 ‘b‘’s, not necessarily consecutive
^[^b]*b[^b]*b[^b]*$
Starts with c, followed by one lowercase vowel, and ends with any number of lowercase letters
^c[aeiou][a-z]*$
Starts with a and has exactly 0 or 1 letter after that
^a[A-Za-z]?$
Only lowercase letters, in any amount, in alphabetic order
^a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*$
Contains one or more ab or ba
(ab|ba)+
Precisely steve, steven, or stephen
^ste(ve|ven|phen)$
Even length string
^(..)*$
Even number of lowercase vowels
^([^aeiou]*[aeiou][^aeiou]*[aeiou][^aeiou]*)*$
Starts with anything but ‘b‘, followed by one or more ‘a‘’s and then no other characters
^[^b]a+$

4.2.5 Basic Functions🔗

matches: Extracts all matched substrings as a list
compile: Compiles a regular expression into an executable version that can be used to match strings
exec: Matches a string against the compiled expression and returns the matched groups if any
split: Splits a string into chunks separated by the regular expression

4.3 Extracting Substrings with Capturing Groups🔗

Capturing groups allow you to extract specific parts of a matched string. The Re library remembers which strings matched the parenthesized parts of a regular expression. These parts can be referred to as groups.

Example: Capturing Groups


let r = str2re "^Min:([0-9]+) Max:([0-9]+)$";;
let t = Re.exec r "Min:50 Max:99";;   
let min = Re.Group.get t 1;;    (* 50 *)
let max = Re.Group.get t 2;;    (* 99 *)

Input:

Min:1 Max:27
Min:10 Max:30
Min:  11 Max: 30
Min: a Max: 24


Desired Output:

min=1 max=27
min=10 max=30

By using capturing groups with parentheses, you can extract the numeric values and reformat the output accordingly.

contents ← prev up next →

1	Functional Programming with OCaml
2	Imperative Programming with OCaml
3	Property-Based Randomized Testing
4	OCaml Regular Expressions
5	Finite Automata
6	Context Free Grammars
7	Parsing
8	Operational Semantics
9	Type Checking

4.1	What Are Regular Expressions?
4.2	Basic Concepts
4.3	Extracting Substrings with Capturing Groups