logo Around PEGs with
Gavin Wraith
 
 

Finding text

Everybody will have had experience of trying to find things; in text or in html documents, in a file or a directory of files, on hard disc or on the world wide web. Usually one is looking for some explicit piece of text, a word or a phrase. If you are not quite sure of the exact spelling or if you are seeking many items it is sometimes convenient to be able to use wildcards: either a symbol that may stand for any single character or a symbol that may stand for any sequence of characters. Which symbols are used for these purposes depends on the software one is using. For example RISC OS uses # and * respectively in filenames, whereas StrongED's search syntax uses . and * while . and .* are what Edit uses. Unfortunately, once you start using some characters in this way (as magic characters, so jargon has it) then you have to devise special notations when you want to refer to these characters in their plain unmagical sense, and that leads to yet further complexity. The result is that search patterns are usually eye-wateringly unreadable, even when they stick to some widely adopted notation such as Perl Regular Expression syntax.

The standard facilities for searching let you use two sorts of wildcards, and possibly the choice of whether the case of letters matters, and that is often the end of it. But sometimes you might want more. You may want not only to find things but at the same time do something with what you find. For example, what about the following?

  • (1) Match occurrences of prat that either follow dandi or that are not followed by fall.
  • (2) Remove all the Javascript from an html page.
  • (3) Separate embedded program code from surrounding text and run it, so that the program manipulates the text containing it.

I think you will agree that these sorts of searching involve quite a lot more than just wildcards. This brings us to an important question of theory: we can describe in English all kinds of fanciful things we might want to search for, but which kinds can a computer be programmed to tackle? What can be programmed? What can be programmed efficiently? In an article Matching and Catching in Foundation Risc User CD number 15 (July 2004), I mentioned some of the history and theory behind pattern matching algorithms. In particular I described the seminal work of Steven Kleene in the 1930s, which led to the notion of Regular Expressions (RegExps), which are used by most software today.

What I want to present in this article is a quite different approach, which is still experimental and not yet widely available (except for RISC OS users, with RiscLua version 4.12 or higher). It has two novel aspects: 1) the syntax is easier to understand, does not use magic characters and is more flexible; 2) it is based on Parser Expression Grammars (PEGs) rather than on RegExps. It has recently been pointed out (by Bryan Ford) that although RegExps are useful for parsing natural languages (where ambiguity is an essential feature) PEGs are much better suited for programming languages. PEGs can match a wider class of patterns and they can be implemented by more efficient code. They are a formalization of Recursive Descent Parsers. No more about the theoretical background; if you need more, use Wikipedia. Instead, I will illustrate with some practical examples how PEGs can be used, using StrongED and RiscLua scripts. From version 4.12 RiscLua incorporates the lpeg library of Roberto Ierusalimschy. First we introduce lpeg's syntax for patterns and consider how patterns can be combined.

Arithmetic of Patterns

The lpeg library provides functions whose names start with the prefix lpeg. such as lpeg.P, lpeg.R, lpeg.S. We can avoid having to write these prefixes by using local variables. In Lua it is better to use local variables wherever possible. RiscLua has the extra facility that we can make statements like

local P,R,S in lpeg

so that P,R,S become local variables standing for lpeg.P,lpeg.R,lpeg.S. Such statements help to make code less verbose and more efficient. All our examples of code will presume that these and similar abbreviations have occurred previously.

The whole tangle of magic characters and unreadable syntax, which bedevils RegExp notation, arises from the elementary error of not distinguishing between different datatypes. For lpeg, patterns are objects in their own right, and are not strings. One of the recurring themes in the development of programming has been that trouble usually comes from conflating things that are not the same. Booleans are not really integers. Floating point numbers and whole numbers are different beasts. We are used to distinguishing between the number 123 and the string "123". So now we have to smarten up and recognize the difference between these and the patterns P(123) and P("123"). The former is a pattern that matches any string that is at least 123 characters long, and the latter is a pattern that matches any string that starts with the prefix "123". The function lpeg.P is an all-purpose converter of things into patterns. For example, P(true)is a pattern that matches anything, whereas P(false) matches nothing. P can convert functions and tables, too, if they are of an appropriate kind. It converts patterns to themselves. But what are patterns? That they are an abstract datatype is an answer that will probably not satisfy your curiosity. The best way to understand them is to see examples, understand how they can be combined, and see them in action in StrongED scripts.

Given a string s and a pattern p we say that p succeeds on s if s begins with a substring that matches p. Otherwise we say that p fails on s. The expression p:match(s) returns a non-nil value if p succeeds on s and nil if it fails. The non-nil value depends on what sort of pattern is being used, in particular on what the pattern captures. The default situation is that the non-nil value is one plus the index of the last character matched. In the case of literal string patterns like P "Poly", the value returned will be the index of the character following the matching prefix.

(P"Poly"):match "Polymath" --> 5
(P(2)):match "Polymath" --> 3
(P"Poly"):match "!Polymath" --> nil
(P""):match "!Polymath" --> 1

When an expression like p:match(s) is evaluated a pointer is set to the beginning of the string s and, as the matching process proceeds, this pointer may be moved forward. When a subprocess fails the pointer is not advanced.

There are two basic ways of combining patterns - alternation and concatenation. These are denoted by addition and multiplication, respectively. So, if x and y are patterns, the pattern x+y is one which tests x first, and if that fails then tests y. The pattern x*y tests x first and if that succeeds tests y at the new pointer value. We can also subtract patterns. The pattern x-y first tests y, and only if that fails does it test for x. The pattern -x only succeeds if x fails. The pattern -(-x), which can also be written as #x, behaves just like x, except that the pointer is not advanced.

Example (1) in the previous section is answered by the pattern p given by

x, y, z = P "prat", P "dandi", P "fall"
p = (y*x) + (x*(-z))

Note that neither addition nor multiplication of patterns is commutative. We can also exponentiate patterns.

The pattern x^n will match at least n consecutive matches of x, while x^(-n) will match at most n consecutive matches of x, if n is not negative. For example, if we have X=P "X" and I = P "I" then X*I^0*X will match XX, XIX, XIIX, . . , whereas X*I^(-1)*X will only match XX and XIX.

If one of the terms in an arithmetical expression is a pattern, then the other terms will be coerced to patterns by having the function lpeg.P applied to them. So if patt is a pattern that matches a class of single characters then 1-patt will match any character not in that class.

For example, the pattern csv_field, defined below, matches a field in a comma-separated value (CSV) record:

dq, newline, comma = P "\"", P "\n", P ","
quoted = dq*((1 - dq) + (dq*dq))^0*dq
unquoted = (1 - dq - newline - comma)^0
csv_field = quoted + unquoted

This is probably easier to read than the official CSV specification! A field consists either of quoted data or unquoted data. Quoted data is enclosed in double-quotes and cannot itself contain double-quotes except as consecutive pairs. Unquoted data may not contain double-quotes, newlines or commas.

For convenience the following functions also provide a way of creating a single-character pattern: lpeg.S takes a string as an argument and returns a pattern that matches any of its characters; lpeg.R takes strings denoting ranges as arguments and returns a pattern that matches any character in the ranges. For example, R("az","AZ") matches letters of the alphabet.

Note that patterns must match at the beginning of a string. If p is a pattern then (p+1)^1 will match p somewhere in the string, not necessarily at the beginning. We can read this as try   p and if it fails skip one character and try again. Remember that p + 1 is coerced to p + P(1); so if p fails P(1) is tried and this always succeeds (and bumps the pointer up by one character) if there are more characters in the string. If n is a nonnegative integer then P(n) succeeds on strings with at least n characters and P(-n) succeeds on strings with at most n characters.

Grammars

A grammar, from our point of view, is an alternative way of writing a set of equations for patterns. It is a table. Here is the CSV example above re-expressed as a grammar:

csv_field = P {
 "field";
 field = V "quoted" + V "unquoted";
 quoted = V "dq" *((1 - V "dq" + V "dq" * V "dq")^0*V "dq";
 unquoted = (1 - V "dq" - V "newline" - V "comma")^0;
 dq = P "\"";
 newline = P "\n";
 comma = P ",";
}

In the jargon of grammars the keys are nonterminal symbols. The nonterminal symbol which represents the whole grammar must come first as a string, followed by the definitions of the nonterminal symbols. The function lpeg.V takes a string representing a nonterminal symbol as argument and (lazily) returns the corresponding pattern. Although this formulation may look clumsier, it has an important advantage. It permits recursive definitions of patterns. There is one restriction, however. A nonterminal must not be called recursively when matching to a string, until the pointer has moved to the right from the position it had when the nonterminal was last called. In other words, recursive occurrences of nonterminals must be guarded on the left by expressions which do not match an empty string. The possibility of recursive grammars means that PEGs can reach the parts that RegExps cannot. Because a finite state machine cannot record arbitrary depth of nesting it is impossible to match correct nesting of parentheses using just RegExps; this is why most pattern-matching libraries calling themselves RegExp libraries in fact contain ad hoc additions to compensate for such limitations. By contrast, PEGs can cope with nesting without any fuss.

nested = \(text)
 local P,S,V in lpeg
 local nested_parens = P {
    "nested";
    nested = "("*((1-S"()")+V"nested")^0*")";
                          }
    => nested_parens:match(text)
 end -- function

The pattern nested_parens will match text in properly nested parentheses.

Captures

A pattern can be used for more than simply matching. It can also capture information from the string to which it is applied, and that information can be returned when the match succeeds. This information could be, for example, the position in the string where the match succeeded, or the matched substring itself, or other data concocted from the captures already made by constituent subpatterns. You have to think of a pattern as something that can sweep along a string gathering up captures, and transforming them as it does so.

A pattern gathering captures as it sweeps along a string

The function match returns the captures from a successful match (remember that in Lua a function can return multiple values). The lpeg library contains functions that invest a pattern with capturing capabilities.

The function lpeg.Cp takes no arguments. It returns a pattern which matches empty strings and which captures the current position.

do
 local P,Cp in lpeg
 local Arthur = P "Arthur"
 local patt = (Cp()*Arthur + 1)^1
 print(patt:match "King Arthur was a goodly king")
end    --> 6

The function lpeg.C takes a pattern as an argument. It returns the same pattern augmented with the ability to capture the matched substring. If its argument already has captures, they will be returned after it.

do
 local S,C in lpeg
 local roman = S "IVXLCDM"
 local patt = (C(roman^1)+1)^0
 print (patt:match "A XX-th century fox")
end --> XX

The function lpeg.Ct takes a pattern as an argument. It returns the same pattern but with modified capture behaviour; all the captures of its argument are put into a table with integer indices, starting at 1, and this table becomes the single capture.

More generally, if patt is a pattern and f is a function then patt/f is the pattern obtained by replacing the captures of patt by the result of applying the function f to them. We call this a function capture. Alternatively f can be a table, in which case the first capture of patt is used as a key and the value of f at that key becomes the single capture. Or f can be a format string containing symbols %n with n between 1 and 9, standing for the match of the n-th capture in patt. The symbol %0 stands for the whole match.

The function lpeg.Cc takes a value of some kind as an argument. It returns a pattern which matches empty strings and which captures that value. This is useful in conjunction with accumulator captures furnished by the function lpeg.Ca.The argument pattern should produce at least one captured value, which is taken as the initial value of an accumulator. The argument pattern may produce function captures; in that case the function is called with the accumulator as first argument (any other arguments being given by existing captures) and its return value becomes the new value of the accumulator. The final accumulator value is taken to be the single captured value.

The function lpeg.Cs takes a pattern as an argument and invests it with a substitution capture. This captures the substring matched by the argument pattern, but with substitutions. For any capture inside the argument pattern, the substring matched by it is replaced by the capture value, which must be a string.

do
 local gsub = \(s, pat, replace)-- global substitution
  local P,Cs in lpeg
  pat = P(pat)
  pat = Cs((pat/replace + 1)^0)
  => pat:match(s)
 end -- function
 print(gsub(text,"seperat","separat"))
end

The third argument to gsub can be a string, function or table.

StrongED scripts

This is the apply icon of a StrongED window (for version 4.67 and higher). If a RiscLua program begins with the line

#! lua

then SHIFT-dragging the icon of the file containing it onto the apply icon of a StrongED window will cause the program to transform the contents of the window. To be more specific, the initial contents of the window are read in by the program as if they were in a file whose name is the value of the expression arg[1]. What the program outputs to stdout becomes the new content of the window.

CTRL-dragging to the apply icon will leave the window unchanged and open a new window for output. This is safer for debugging purposes.

A typical script that uses pattern matching will have the following stages:

We can see these in scripts that deal with examples (2) and (3):

#! lua
-- noscript
-- remove scripting from html
-- GCW 29/04/07
do
 io.input(arg[1]);local text = io.read "*all";io.input()
 local write = \(x) io.write(x); => true end
 local P,C in lpeg
 local dq,sq,ket = P '"',P "'",P ">"
 local string = dq*(1-dq)^0*dq + sq*(1-sq)^0*sq
 local stop = P "</script>" + P "</SCRIPT>"
 local start = P "<script " + P "<SCRIPT "
 local js = start*(string+1-ket)^0*ket*(string+1-stop)^0*stop
 local pattern = (js + (C((1 - start)^1))/write)^0
 if not pattern:match(text) then print "FAIL" end -- if
end -- do

#! lua
-- untangle
-- separate the text and the code embedded in it
-- and run the code between the tags <lua> and </lua>
do
 io.input(arg[1]);local s = io.read "*all";io.input()
 local P,C,Ct in lpeg
 local concat in table
 local start,stop = P "<lua>" ,P "</lua>"
 local script,stk = start*C((1-stop)^0)*stop,{}
 local push = \(x) stk[1+#stk] = x end
 local nonscript = P(1)/push
 local codepat = Ct((script+nonscript)^0)
 local code = codepat:match(s)
 text = concat(stk)
 if code then
   assert(loadstring(concat(code)),"Bad code")()
 end -- if
end -- do

The global variable text will refer to the surrounding text. So embedding

<lua>print(text)</lua>

inside some text will mean that the untangle script will just strip out the program. But the embedded program can be a lot more complicated.

Here for example is text which can be printed
out<lua>print</lua> with double spacing.
<lua>((text:gsub("\n",</lua>Note that<lua>"\n\n")))</lua>
the program text can be split into as many pieces as you like.

Links

For a pictorial description of how to combine patterns see
http://www.wraith.plus.com/lua/lpeg.html .

For the reference manual for the lpeg library see
http://www.inf.puc-rio.br/~roberto/lpeg/ .