title logo

back contents forward

Chapter 9

patterns

Lua's string library has powerful functions for matching and catching patterns. To use them you must learn Lua's notation for patterns. The expression

            i, j = s:find (p, n)
will make i the start and j the end positions of the first instance of a pattern p in the string s after the n-th position, if there is such an instance. Otherwise i will have the value nil . If the second argument n is omitted it is taken by default to be 1. Negative integers can be used to count from the end of the string, so that -1 refers to the last position, -2 the next to last and so on.

A pattern is simply a string. The following characters play a special role for patterns and are known as magic characters.

  (  )  .  %  +  -  *  ?  [  ]  ^  $
Non-magic characters just stand for themselves. The character ^ stands for start of string and $ for end of string. For example the pattern "man$" matches any string ending in man .


        i, j = ("woman"):find ("man$")
        print(i, " ", j) --> 3 5
The character . stands for any single character. So the pattern "^...$" matches all strings of length 3. The magic character % plays the role of an escape character; this means that single magic characters are matched by the patterns got by prefixing the magic character with % . For example a final dollar sign in a string is matched by "%$$" . The following combinations have special meanings as character classes:


    %a  any letter of the alphabet
    %A  any character not a letter of the alphabet
    %c  any control character
    %C  any character not a control character
    %d  any digit
    %D  any character not a digit
    %p  any punctuation symbol
    %P  any character not a punctuation symbol
    %s  any white space character
    %S  any character not a white space character
    %u  any uppercase letter
    %U  any character not an uppercase letter
    %w  any alphanumeric character
    %W  any non-alphanumeric character
    %x  any hexadecimal digit
    %X  any character not a hexadecimal digit
    %z  the character with ASCII code 0
    %Z  any character except ASCII NUL

You can create your own character classes using square brackets. For example "[%w_]" matches any alphanumeric character or underscore. We could express the pattern "%d as "[0123456789]" or as "[0-9]" . Similarly we could express "%a" as "[a-zA-Z]" . Expressions for patterns are far from unique. If the character ^ follows immediately after the opening square bracket then we get the complementary character class. For example "[^%.]" matches all characters except a full stop.

The magic characters + - * and ? are used as modifiers after a character class, with the following meanings:

       +  longest sequence of at least one
       -  shortest sequence of zero or more
       *  longest sequence of zero or more
       ?  zero or one
So "^%d*" matches the longest initial sequence of digits in a string, including an empty sequence.

The expression %b xy matches balanced strings that start with the character x and end with y. Typical uses are %b(), %b<>, %b%% and so on.

The expression %f  followed by a character class specified by square brackets matches a frontier between characters not in the class and characters in the class.

captures

Usually we want not only to find whether a pattern is matched in a string, but also, for each instance of a match, what the actual matching substrings are for some parts of the pattern. These substrings are called captures and we can indicate which parts of a pattern we wish to capture by enclosing them in parentheses, the only magic characters we have so far not explained. The captures are ordered by the position of the opening parenthesis. They can, of course, be nested. The find method returns the captures as extra results after the two indices, if it is successful.

    s = [[url="http://www.lua.org"]]
    i, j, adr = s:find [[url="(.*)"]]
    print(i, " ", j, " ", adr) --> 6 23 http://www.lua.org
You can use the captures in the pattern itself. The expression % n, where n is a single digit, matches the n-th capture. So, for example


         quoted = [[(["'])(.-)%1]] 
matches quoted strings. The first capture is the quote character, the second is the content of the string.

How do we cope with lots of matches? The iterator method gmatch  is used as follows:

  for x, y, . . . in s:gmatch (pat) do body end -- for
The list of variables should correspond to the captures specified in the pattern. If no captures are specified in pat then only one iteration variable is needed and its values are the substrings that match the whole of pat .
global substitution

The string method gsub is very powerful. In its simplest form

      s_new = s_old:gsub (pattern, replacement)

it searches the string s_old for matches to pattern and returns a new string s_new got by substituting each match by the replacement string. This can contain captures %1, %2 and so on if pattern specifies them. A more powerful form is where replacement is a function. The arguments to it are the captures and the returned value is used as the replacement string for that particular match.

If file evaluates to the canonicalized name of a file that is not in the root directory of a filing system, we can obtain the directory it lies in as

           file:gsub ("%.[^%.]*$","")
because we have simply removed from the file's name everything after and including the last full stop in it.
MP3 example

Here is a short example program that pulls together lots of different ideas.

#! lua
-- Gavin Wraith (12/10/2007)
-- add /mp3 suffix to files of type 0x1ad
-- that do not already end in that suffix
-- in the arg[1] directory and all its subdirectories

local sep, suffix, sufpat = ".", "/mp3", "/[mM][pP]3$"
local warn = "Need a directory or an mp3 file, please"
local cmd = "rename %s %s"
local ampeg, dirtype = 0x1ad, 0x1000
local dir, filetype in riscos
local execute in os

local cmdlist = {}
local scan, addsuffix, action

addsuffix = \ (file)
  if not file:match (sufpat) then
   cmdlist[1 + #cmdlist] = cmd:format (file, file..suffix)
  end -- if
  end -- function

scan = \ (d)
  for leaf, kind in dir (d) do
   local func = action[kind]
   if func then func (d..sep..leaf) end -- if
  end -- for
    end -- function

action = { [ampeg] = addsuffix; [dirtype] = scan; }

--- main program
local d = arg[1]
local func = action[filetype (d)]
if func then func (d) else error (warn) end -- if
for _, rename in ipairs (cmdlist) do
 execute (rename)
end -- for

The first line's comment is just there so that StrongED knows to use the Lua mode, when it comes to editing. Then come comments which give the author and date and what the program is supposed to do. We start by making some definitions. The pattern  sufpat  will match names ending in any of the four possibilities  /mp3, /Mp3, /mP3, /MP3 . As is usually the case with RiscLua programs, it is easier to get at the meaning by reading them backwards.

The comment  -- main program  shows where to start. We define a local variable  d  to be the value of  arg[1]  . Why bother? Why not just use  arg[1]  ? Because we want to avoid evaluating this expression more than once. Note that  d  is used twice, once in the next line, and again in the line after that. The next line defines a value  func  which is the value of the  action  table at the filetype of the filer object given by  arg[1]  . We see that the  action  table has two keys, one for MP3 files and another for directories. The idea is that  func  will be a function for dealing with the requisite filetype: if an MP3 file is given, then we want to use the function  addsuffix  and if it is a directory we want to use  scan In other words, rather than using conditional tests with lots of  if  statements, we prefer to use a jump table. This is more efficient, and easier to extend if we need to cater for more cases. This technique should be used in circumstances wherever you would use a  switch  statement if you were programming in C or a  CASE  statement if you were programming in BBC Basic. But what happens if the value given by  d  is neither an MP3 file nor a directory? In that case  func  will not be defined, i.e. it will be  nil . So, in the next line, we apply  func  if it is defined and give a warning otherwise. Finally, we execute all the renaming commands which have been gathered into the list  cmdlist  by the calls to the  addsuffix  function.

An important point is that the three variables  scan, addsuffix, action  are all declared local before they are defined. That is because their definitions are mutually recursive;  scan  depends on  action and  action  depends on  scan  and  addsuffix . This arrangement can be taken as a template for scanning directories recursively. The local variable  dir  has the value  riscos.dir  which gives iteration over directories (this is particular to RiscLua - it is not standard Lua).


back contents forward