title logo

back contents

Chapter 12

lpeg library

In Chapter 9 we met the facilities in the  string  library for matching and catching patterns. These, while often convenient, are sometimes rather limited when it comes to searching for alternatives. Furthermore, patterns are there treated simply as strings; an approach that necessitates the use of an escape character,  % , which leads to expressions that are usually hard to read. The  lpeg  library, by contrast, provides a more powerful and expressive notion of pattern. It treats patterns as objects in their own right, which can be combined and manipulated with an algebra that uses the standard symbols for arithmetic.

The whole purpose of a pattern is that it should be matched against a string. An assignment

       x, y, ... = p:match (s, i)
where  p  is a pattern,  s  is a string and  i  is an optional position in the string (default value 1), returns captures from the matching operation or  nil  if the match has failed . It is possible, from version 8 onwards, to give further arguments which can be used to pass contextual information in to a pattern. We will see below how to specify what a pattern captures. The default capture, if no specification is made, is the position after the last character of the matched substring; in other words the next position in the string, if there is one, or one more than the length of the string otherwise. The matching process maintains a pointer, whose initial value is set at  i , which is updated as the process evolves. A pattern that fails does not update the pointer.

From here on we suppose that we are in the lexical environment of  lpeg . Here is a simple example.

       p = P "beau"
defines  p  as the pattern that matches the literal string  "beau" .
      p:match "beauty" --> 5
The letter in the 5-th position follows the matched prefix.
combining lpeg patterns

In the following  A  and  B  denote lpeg patterns.

A + B
When  A + B  is matched, first of all the pattern  A  is tested for a match at the pointer. If it succeeds then, without any testing of  B , the pattern  A + B  succeeds and the pointer is updated past the matching substring. If it fails then  B  is tested for a match at the pointer, and the success or failure of  A + B  depends on the result. Note that addition of patterns is not commutative. The order of the summands determines the order in which they are tested.
A - B
When  A - B  is matched, first of all the pattern  B  is tested for a match at the pointer. If it succeeds then  A - B  fails and the pointer is not updated. If it fails, then  A  is tested for a match at the pointer, and the outcome for  A - B  is determined by the result.
-A
This pattern succeeds when  A  fails and fails when  A  succeeds. In neither case is the pointer updated. The pattern  -(-A)  (which may be written as  #A ) behaves just like  A  except that the pointer remains fixed.
A * B
This pattern is matched by first matching  A  at the pointer. If that fails, then  A*B  fails, with no testing of  B . If it succeeds then  B  is tested at the updated pointer. If that fails then the pointer is not updated any further, and  A*B  has had partial success. If that succeeds then the pointer is again updated.
A^n
If  n  is a nonnegative integer then this matches if the string starts with at least  n  consecutive substrings matched by  A  and the pointer is updated past them. If  n  is negative then this matches the string if there are at most  -n  consecutive substrings matched by  A  and the pointer is updated past them. Note that  A^(-1)  matches if there is one or no matches of  A  and  A^0  matches if there are as many matches of  A  as you please, including none.
creating lpeg patterns

We have seen above how the function  P  converts a string to the pattern that matches that string. In fact it can take many other kinds of argument. So

P (true)      -- always matches
P (false)     -- always fails
P (n)         --[[
 if n is a non-negative integer matches
 strings of at least n characters. If n is
 negative, matches strings that do not have -n
 characters.]]
P (s)      -- matches string s
P (p)      -- is pattern p
P (f)      --[[
 f is a function of two variables.
 When this pattern is matched against a
 string the function f is called with first
 argument the string and second the pointer
 position. The match succeeds if f returns
 a number that is a further position in the
 string, and the pointer is updated to it;
 otherwise the match fails.]]
P (t)         --[[
 when t is a table it is interpreted
 as a grammar. The keys of the table denote
 nonterminal symbols. The leading rule is
 given by t[1] if that is a pattern and
 by t[ t[1] ] otherwise.]] 
The pattern  P(0)  matches an empty string, while  P(-1)  only matches at the end of a string.

Here are some other pattern-creating functions.

R (r1, ... ) --[[
   the arguments are 2-character strings
   interpreted as ranges of characters.]]

So, for example,  R "ad"  is equivalent to
(P "a") + (P "b") +
    (P "c") + (P "d")
The function  S  converts a string to a pattern that matches any single character occurring in the string.
     vowel = S "aeiouy"
The notation can be greatly simplified by using the fact that the arithmetic operations will coerce all their other arguments to patterns, using  P  once they know that one of them is a pattern. So if  p  is a pattern, then  (p + 1)^1  is equivalent to
(p + P (1))^1
We can read this as look for  p  and if that fails skip one character and try again, repeating indefinitely. In other words, we can write:
     somewhere = \ (p) => (p + 1)^1 end -- function
so that  somewhere (P "a rose blooms")  matches any string in which  "a rose blooms"  occurs as a substring. It is important to understand that patterns in the  lpeg  library, unlike those in the  string  library, are anchored to a fixed starting position.

Inside a table defining a grammar, the expression  V (k)  denotes the pattern defining the nonterminal symbol  k . This is not evaluated until the pattern given by the grammar is matched. Grammars provide a way of defining patterns recursively.

captures

When a pattern is matched against a string, the pointer sweeps through the string, and one can demand of the pattern that various pieces of information be picked up, processed and returned as a capture. The two basic sorts of information one might want correspond to the questions where? and what?. For the former the pattern

                Cp ( )
matches an empty string and returns the pointer value as a capture. For example
      do
       local P, Cp in lpeg
       s = "To lave the feet of the Zatcoon"
       somewhere = \ (x) => (x+1)^1 end
       p = somewhere (Cp ( ) * "Zatcoon")
       print (p:match (s)) --> 25
      end

For the latter the function
                 C
decorates a pattern by making it capture the matching substring.
p = 16 * C (7) * "?"
print (p:match "The feet of the Zatcoon?") --> Zatcoon

There is actually a third basic kind of datum - something ad hoc that does not necessarily have any connection with the string. The pattern
                 Cc (x)
matches an empty string and returns the value  x  as capture.

Then we have functions which simply process the captures of a pattern. If  p  is a pattern and  f  is a function, the expression  p / f  is the pattern modified so that its captures are obtained as the values returned when the function is applied to the original captures.

p = (16 * C (7) * "?") / string.upper
print (p:match "The feet of the Zatcoon?") --> ZATCOON

Alternatively  f  can be a table, so long as there is just one capture. We saw in chapter 5 that tables can be thought of as functions of a single variable. Or,  f  can be a string with marks  %1 ,  %2 , .... .

The function  Ct  replaces all the captures of a pattern by a table containing those captures.

The function  Cs  replaces a pattern by substituting its nested captures into the matching substrings of the string it is applied to.

The function  Ca  creates an accumulator capture. Its argument should produce at least one capture, which becomes the initial value of the accumulator. Any subsequent function captures are called with the accumulator as first argument, followed by any other arguments provided, and the returned value becomes the new value of the accumulator. The final value of the accumulator is the single value returned as the capture when the resulting pattern matches.


   local R, Ca in lpeg
   local number = R "09"^1 / tonumber
   local add = \ (acc, x) => acc + x end
   local sum = Ca (number * ("," * number / add)^0)
   print (sum:match "1,4,9,16,25") --> 55

For the full details see the Lpeg manual.


back contents