4 OCaml Regular Expressions
Concatenating two strings
Extracting substrings
Searching for substrings and replacing them with something else
4.1 What Are Regular Expressions?
Searching and matching text
Formally describing strings
Defining the symbols (lexemes or tokens) that make up a language
Regular expressions are common across many languages and tools, including sed, grep, awk, Perl, Python, and Ruby. They were popularized (and made fast) as a language feature in Perl and are based on elegant theoretical foundations that we will cover in a future lecture.
RE: a pure OCaml regular expressions library that supports several formats (glob, posix, str…) In this lecture, we will use the posix format of the RE library
Str: OCaml comes with the Str module. This module is not recommended because it is not particularly fast It does not support Unicode
OCaml REPL
#require "re";; (* only needed in utop *) let str2re t = Re.Posix.compile (Re.Posix.re t);; let r = str2re "[a-z][0-9]+";; (* The pattern r matches strings that starts with a letter followed by one or more digits *) Re.matches r "a12#b22abcd";; ;; # # val str2re : string -> Re__.Core.re = <fun> # val r : Re__.Core.re = <abstr> # - : string list = ["a12"; "b22"] #
4.2 Basic Concepts
- Basic Building Blocks
Empty set (∅): Matches nothing
Empty string (ε): Matches an empty string
Symbols (character literals) from the alphabet: Any character from Σ (the alphabet)
- Operations on Building Blocks
Union (|): Provides a choice between patterns
Concatenation (. or implicit): Joins patterns together
Kleene Star (*): Repeats a pattern zero or more times
Building Complex Patterns
By applying these operations recursively, we can construct arbitrarily complex patterns from the basic building blocks. This compositional structure makes regular expressions both powerful and mathematically elegant.
"OCaml": Strings are matched exactly
"a|b": A vertical bar separates alternatives (Boolean OR)
"ab*": A quantifier (‘?‘, ‘*‘, ‘+‘, ‘{n}‘) after an element specifies how many times the element is allowed to repeat
4.2.1 Repetition in Regular Expressions
Pattern |
| Meaning |
| Example |
| Matches |
e* |
| Zero or more occurrences of e |
| a* |
| "",a,aa,aaa, ... |
e+ |
| One or more occurrences of e (same as ee*) |
| a+ |
| a,aa,aaa, ... |
e? |
| Exactly zero or one e |
| a? |
| "",a |
e{x} |
| Exactly x occurrences of e |
| a{3} |
| aaa |
e{x,} |
| At least x occurrences of e |
| a{2,} |
| aa,aaa, ... |
e{x,y} |
| At least x and at most y occurrences of e |
| a{2,4} |
| aa, aaa, aaaa |
‘bc*‘ matches ‘"b"‘, ‘"bc"‘, ‘"bcc"‘, ...
‘a+b*‘ matches ‘"a"‘, ‘"ab"‘, ‘"aa"‘, ‘"aab"‘, ‘"aabb"‘, ‘"aabbb"‘, ‘"aaa"‘, ...
Precedence
(OCaml)* means "", "OCaml", "OCamlOCaml", ...
OCaml* means "OCam", "OCaml", "OCamlllll", ...
Best practice:Use parentheses to disambiguate. Note that parentheses also have another use: to extract matches (as we’ll see with capturing groups).
4.2.2 Character Classes
[abcd]: Matches any of ‘{"a", "b", "c", "d"}‘ (can also be written as ‘(a|b|c|d)‘)
[a-zA-Z0-9]: Matches any upper- or lower-case letter or digit
[^0-9]: Matches any character except 0-9 (the ‘^‘ means "not" and must come first)
[\t\n ]: Matches tab, newline, or space
[a-zA-Z_\$][a-zA-Z_\$0-9]*: Matches Java identifiers ($ escaped)
Special Characters
^: Beginning of line
$: End of line
\$: Literal dollar sign
Note: Languages like Ruby and Python provide additional special characters.
- ^
Inside a character class: means "not" (e.g., ‘[^0-9]‘)
Outside a character class: means beginning of line
- ( )
Inside character classes: literal characters - Note: ‘/(0..2)/‘ does NOT mean 012
Outside character classes: used for grouping
- -
Inside character classes: range (e.g., ‘[a-z]‘ means a to z)
Outside regular expressions: subtraction
4.2.3 Summary of Regular Expression Syntax
re – matches regexp ‘re‘
(re1|re2) – match either ‘re1‘ or ‘re2‘
(re)* – match 0 or more occurrences of ‘re‘
(re)+ – match 1 or more occurrences of ‘re‘
(re)? – match 0 or 1 occurrences of ‘re‘
(re){2} – match exactly two occurrences of ‘re‘
[a-z] – same as ‘(a|b|c|...|z)‘
[^0-9] – match any character that is not 0, 1, etc.
^, $ – match start or end of string
Try out regular expressions at rubular.com
4.2.4 Regular Expression Examples
Any string containing two consecutive ab
(ab){2}
Any string containing a or two consecutive b
a|bb
Contains sss or ccc
s{3}|c{3}
Contains exactly 2 ‘b‘’s, not necessarily consecutive
^[^b]*b[^b]*b[^b]*$
Starts with c, followed by one lowercase vowel, and ends with any number of lowercase letters
^c[aeiou][a-z]*$
Starts with a and has exactly 0 or 1 letter after that
^a[A-Za-z]?$
Only lowercase letters, in any amount, in alphabetic order
^a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*$
Contains one or more ab or ba
(ab|ba)+
Precisely steve, steven, or stephen
^ste(ve|ven|phen)$
Even length string
^(..)*$
Even number of lowercase vowels
^([^aeiou]*[aeiou][^aeiou]*[aeiou][^aeiou]*)*$
Starts with anything but ‘b‘, followed by one or more ‘a‘’s and then no other characters
^[^b]a+$
4.2.5 Basic Functions
matches: Extracts all matched substrings as a list
compile: Compiles a regular expression into an executable version that can be used to match strings
exec: Matches a string against the compiled expression and returns the matched groups if any
split: Splits a string into chunks separated by the regular expression
4.3 Extracting Substrings with Capturing Groups
Capturing groups allow you to extract specific parts of a matched string. The Re library remembers which strings matched the parenthesized parts of a regular expression. These parts can be referred to as groups.
Example: Capturing Groups
let r = str2re "^Min:([0-9]+) Max:([0-9]+)$";;
let t = Re.exec r "Min:50 Max:99";;Â Â
let min = Re.Group.get t 1;; Â (* 50 *)
let max = Re.Group.get t 2;; Â (* 99 *)
Input:
Min:1 Max:27
Min:10 Max:30
Min: 11 Max: 30
Min: a Max: 24
Desired Output:
min=1 max=27
min=10 max=30By using capturing groups with parentheses, you can extract the numeric values and reformat the output accordingly.