In Chapter 9 we met the facilities in the string
library for matching and catching patterns. These, while
often convenient, are sometimes rather limited when it comes
to searching for alternatives. Furthermore, patterns are there
treated simply as strings; an approach that necessitates
the use of an escape character, %
, which leads
to expressions that are usually hard to read. The lpeg
library, by contrast, provides a more powerful
and expressive notion of pattern. It treats patterns as
objects in their own right, which can be combined and manipulated
with an algebra that uses
the standard symbols for arithmetic.
The whole purpose of a pattern
is that it should
be matched against a string. An assignment
x, y, ... = p:match (s, i)
is a pattern, s
a string and i
is an optional position
in the string (default value 1), returns captures
from the matching operation or nil
if the match has failed
. It is possible,
from version 8 onwards, to give further arguments which can be
used to pass contextual information in to a pattern.
We will see below how to specify what a pattern captures.
The default capture, if no specification is made, is the position after
the last character of the matched substring; in other words
the next position in the string, if there is one, or one more
than the length of the string otherwise. The matching process
maintains a pointer, whose initial value is set at i
, which is updated as the process evolves.
A pattern that fails does not update the pointer.
From here on we suppose that we are in the lexical environment
. Here is a simple example.
p = P "beau"
as the pattern that matches the literal
p:match "beauty" --> 5
The letter in the 5-th position follows the matched prefix.
In the following A
A + B
When A + B
is matched, first of all the pattern A
is tested for a match at the pointer. If it succeeds
then, without any testing of B
, the pattern A + B
succeeds and the pointer is updated past
the matching substring. If it fails then B
for a match at the pointer, and the success or failure of A + B
depends on the result. Note that addition of
patterns is not commutative. The order of the summands determines
the order in which they are tested.
A - B
When A - B
is matched, first of all the pattern B
is tested for a match at the pointer. If it succeeds
then A - B
fails and the pointer is not updated.
If it fails, then A
is tested for a match at the pointer,
and the outcome for A - B
is determined by the result.
This pattern succeeds when A
fails and fails when A
succeeds. In neither case is the pointer updated.
The pattern -(-A)
(which may be written as #A
) behaves just like A
except that the
pointer remains fixed.
A * B
This pattern is matched by first matching A
the pointer. If that fails, then A*B
no testing of B
. If it succeeds then B
is tested at the updated pointer. If that fails then the pointer
is not updated any further, and A*B
has had partial
success. If that succeeds then the pointer is again updated.
is a nonnegative integer then this matches
if the string starts with at least n
substrings matched by A
and the pointer is updated
past them. If n
is negative then this matches the string
if there are at most -n
matched by A
and the pointer is updated past them.
Note that A^(-1)
matches if there is one or no matches
matches if there are as many
matches of A
as you please, including none.
We have seen above how the function P
a string to the pattern that matches that string. In fact it can take
many other kinds of argument. So
P (true) -- always matches
P (false) -- always fails
P (n) --[[
if n is a non-negative integer matches
strings of at least n characters. If n is
negative, matches strings that do not have -n
P (s) -- matches string s
P (p) -- is pattern p
P (f) --[[
f is a function of two variables.
When this pattern is matched against a
string the function f is called with first
argument the string and second the pointer
position. The match succeeds if f returns
a number that is a further position in the
string, and the pointer is updated to it;
otherwise the match fails.]]
P (t) --[[
when t is a table it is interpreted
as a grammar. The keys of the table denote
nonterminal symbols. The leading rule is
given by t if that is a pattern and
by t[ t ] otherwise.]]
The pattern P(0)
matches an empty string,
only matches at the end of a
Here are some other pattern-creating functions.
R (r1, ... ) --[[
the arguments are 2-character strings
interpreted as ranges of characters.]]
So, for example, R "ad"
is equivalent to
(P "a") + (P "b") +
(P "c") + (P "d")
The function S
converts a string to a pattern
that matches any single character occurring in the string.
vowel = S "aeiouy"
The notation can be greatly simplified by using the fact that
the arithmetic operations will coerce all their other arguments
to patterns, using P
once they know that
one of them is a pattern. So if p
is a pattern, then (p + 1)^1
(p + P (1))^1
We can read this as look for p and if that fails skip one character
and try again, repeating indefinitely
. In other words, we can write:
somewhere = \ (p) => (p + 1)^1 end -- function
so that somewhere (P "a rose blooms")
matches any string in which "a rose blooms"
occurs as a substring. It is important to understand that
patterns in the lpeg
library, unlike those in the string
library, are anchored to a fixed starting position.
Inside a table defining a grammar, the expression V (k)
denotes the pattern defining the nonterminal
. This is not evaluated until the pattern
given by the grammar is matched. Grammars provide a way
of defining patterns recursively.
When a pattern is matched against a string, the pointer
sweeps through the string, and one can demand of the pattern that
various pieces of information be picked up, processed and returned
as a capture
. The two basic sorts of information
one might want correspond to the questions where?
. For the former the pattern
Cp ( )
matches an empty string and returns the pointer value as a capture.
local P, Cp in lpeg
s = "To lave the feet of the Zatcoon"
somewhere = \ (x) => (x+1)^1 end
p = somewhere (Cp ( ) * "Zatcoon")
print (p:match (s)) --> 25
For the latter the function
decorates a pattern by making it capture the matching substring.
p = 16 * C (7) * "?"
print (p:match "The feet of the Zatcoon?") --> Zatcoon
There is actually a third basic kind of datum - something ad hoc
that does not necessarily have any connection with the string. The
matches an empty string and returns the value x
Then we have functions which simply process the captures of a pattern.
is a pattern and f
is a function,
the expression p / f
is the pattern modified so that its
captures are obtained as the values returned when the function is
applied to the original captures.
p = (16 * C (7) * "?") / string.upper
print (p:match "The feet of the Zatcoon?") --> ZATCOON
can be a table, so long as there
is just one capture. We saw in chapter 5 that tables can be thought
of as functions of a single variable. Or, f
can be a
string with marks %1
, .... .
The function Ct
replaces all the captures of a
pattern by a table containing those captures.
The function Cs
replaces a pattern by substituting its nested
captures into the matching substrings of the string it is applied to.
The function Ca
creates an accumulator
. Its argument should produce at least one capture,
which becomes the initial value of the accumulator. Any subsequent function
captures are called with the accumulator as first argument, followed
by any other arguments provided, and the returned value becomes the
new value of the accumulator. The final value of the accumulator is the
single value returned as the capture when the resulting pattern matches.
local R, Ca in lpeg
local number = R "09"^1 / tonumber
local add = \ (acc, x) => acc + x end
local sum = Ca (number * ("," * number / add)^0)
print (sum:match "1,4,9,16,25") --> 55
For the full details see the Lpeg manual