Chapter 12
In Chapter 9 we met the facilities in the
string library for matching and catching patterns. These, while
often convenient, are sometimes rather limited when it comes
to searching for alternatives. Furthermore, patterns are there
treated simply as strings; an approach that necessitates
the use of an escape character,
% , which leads
to expressions that are usually hard to read. The
lpeg library, by contrast, provides a more powerful
and expressive notion of pattern. It treats patterns as
objects in their own right, which can be combined and manipulated
with an algebra that uses
the standard symbols for arithmetic.
The whole purpose of a
pattern is that it should
be matched against a string. An assignment
x,y,... = p:match(s,i)
where
p is a pattern,
s is
a string and
i is an optional position
in the string (default value 1), returns
captures from the matching operation or
nil if the match has
failed.
We will see below how to specify what a pattern captures.
The default capture, if no specification is made, is the position after
the last character of the matched substring; in other words
the next position in the string, if there is one, or one more
than the length of the string otherwise. The matching process
maintains a pointer, whose initial value is set at
i , which is updated as the process evolves.
A pattern that fails does not update the pointer.
Here is a simple example.
p = lpeg.P "beau"
defines
p as the pattern that matches the literal
string
"beau" .
p:match "beauty" --> 5
The letter in the 5-th position follows the matched prefix.
In the following
A and
B denote
lpeg patterns.
A + B
When
A + B is matched, first of all the pattern
A is tested for a match at the pointer. If it succeeds
then, without any testing of
B , the pattern
A + B succeeds and the pointer is updated past
the matching substring. If it fails then
B is tested
for a match at the pointer, and the success or failure of
A + B depends on the result. Note that addition of
patterns is not commutative. The order of the summands determines
the order in which they are tested.
A - B
When
A - B is matched, first of all the pattern
B is tested for a match at the pointer. If it succeeds
then
A - B fails and the pointer is not updated.
If it fails, then
A is tested for a match at the pointer,
and the outcome for
A - B is determined by the result.
-A
This pattern succeeds when
A fails and fails when
A succeeds. In neither case is the pointer updated.
The pattern
-(-A) (which may be written as
#A ) behaves just like
A except that the
pointer remains fixed.
A * B
This pattern is matched by first matching
A at
the pointer. If that fails, then
A*B fails, with
no testing of
B . If it succeeds then
B is tested at the updated pointer. If that fails then the pointer
is not updated any further, and
A*B has had partial
success. If that succeeds then the pointer is again updated.
A^n
If
n is a nonnegative integer then this matches
if the string starts with at least
n consecutive
substrings matched by
A and the pointer is updated
past them. If
n is negative then this matches the string
if there are at most
-n consecutive substrings
matched by
A and the pointer is updated past them.
Note that
A^(-1) matches if there is one or no matches
of
A and
A^0 matches if there are as many
matches of
A as you please, including none.
We have seen above how the function
lpeg.P converts
a string to the pattern that matches that string. In fact it can take
many other kinds of argument. So
lpeg.P(true) -- always matches
lpeg.P(false) -- always fails
lpeg.P(n) --[[
if n is a non-negative integer matches
strings of at least n characters. If n is
negative, matches strings that do not have -n
characters.]]
lpeg.P(s) -- matches string s
lpeg.P(p) -- is pattern p
lpeg.P(f) --[[
f is a function of two variables.
When this pattern is matched against a
string the function f is called with first
argument the string and second the pointer
position. The match succeeds if f returns
a number that is a further position in the
string, and the pointer is updated to it;
otherwise the match fails.]]
lpeg.P(t) --[[
when t is a table it is interpreted
as a grammar. The keys of the table denote
nonterminal symbols. The leading rule is
given by t[1] if that is a pattern and
by t[ t[1] ] otherwise.]]
The pattern
lpeg.P(0) matches an empty string,
while
lpeg.P(-1) only matches at the end of a
string.
Here are some other pattern-creating functions.
lpeg.R(r1, ... ) --[[
the arguments are 2-character strings
interpreted as ranges of characters.]]
So, for example,
lpeg.R "ad" is equivalent to
(lpeg.P "a") + (lpeg.P "b") +
(lpeg.P "c") + (lpeg.P "d")
The function
lpeg.S converts a string to a pattern
that matches any single character occurring in the string.
vowel = lpeg.S "aeiouy"
The notation can be greatly simplified by using the fact that
the arithmetic operations will coerce all their other arguments
to patterns, using
lpeg.P once they know that
one of them is a pattern. So if
p is a pattern, then
(p+1)^1 is equivalent
to
(p + lpeg.P(1))^1
We can read this as
look for p and if that fails skip one character
and try again, repeating indefinitely. In other words, we can write:
somewhere = \(p) => (p+1)^1 end -- function
so that
somewhere(lpeg.P "a rose blooms") matches any string in which
"a rose blooms" occurs as a substring. It is important to understand that
patterns in the
lpeg library, unlike those in the
string library, are anchored to a fixed starting position.
Inside a table defining a grammar, the expression
lpeg.V(k) denotes the pattern defining the nonterminal
symbol
k . This is not evaluated until the pattern
given by the grammar is matched. Grammars provide a way
of defining patterns recursively.
When a pattern is matched against a string, the pointer
sweeps through the string, and one can demand of the pattern that
various pieces of information be picked up, processed and returned
as a
capture. The two basic sorts of information
one might want correspond to the questions
where? and
what?. For the former the pattern
lpeg.Cp()
matches an empty string and returns the pointer value as a capture.
For example
do
local P,Cp in lpeg
s = "To lave the feet of the Zatcoon"
somewhere = \(x) => (x+1)^1 end
p = somewhere(Cp()*"Zatcoon")
print(p:match(s)) --> 25
end
For the latter the function
lpeg.C
decorates a pattern by making it capture the matching substring.
p = 16*lpeg.C(7)*"?"
print(p:match "The feet of the Zatcoon?") --> Zatcoon
There is actually a third basic kind of datum - something ad hoc
that does not necessarily have any connection with the string. The
pattern
lpeg.Cc(x)
matches an empty string and returns the value
x as
capture.
Then we have functions which simply process the captures of a pattern.
If
p is a pattern and
f is a function,
the expression
p/f is the pattern modified so that its
captures are obtained as the values returned when the function is
applied to the original captures.
p = (16*lpeg.C(7)*"?")/string.upper
print(p:match "The feet of the Zatcoon?") --> ZATCOON
Alternatively
f can be a table, so long as there
is just one capture. We saw in chapter 5 that tables can be thought
of as functions of a single variable. Or,
f can be a
string with marks
%1 ,
%2 , .... .
The function
lpeg.Ct replaces all the captures of a
pattern by a table containing those captures.
The function
lpeg.Cs replaces a pattern by substituting its nested
captures into the matching substrings of the string it is applied to.
The function
lpeg.Ca creates an
accumulator
capture. Its argument should produce at least one capture,
which becomes the initial value of the accumulator. Any subsequent function
captures are called with the accumulator as first argument, followed
by any other arguments provided, and the returned value becomes the
new value of the accumulator. The final value of the accumulator is the
single value returned as the capture when the resulting pattern matches.
do
local R,Ca in lpeg
local number = R "09"^1/tonumber
local add = \(acc,x) => acc+x end
local sum = Ca(number*(","*number/add)^0)
print(sum:match "1,4,9,16,25") --> 55
end