Chapter 9
Lua's
string library has powerful functions for
matching and catching
patterns. To use them you
must learn Lua's notation for patterns. The expression
i, j = s:find (p, n)
will make
i the start and
j the
end positions of the first instance of a pattern
p in the string
s after the n-th position, if
there is such an instance. Otherwise
i will
have the value
nil . If the second argument
n is omitted it is taken by default to be 1.
Negative integers can be used to count from the end of the
string, so that -1 refers to the last position, -2 the next
to last and so on.
A pattern is simply a string. The following characters play
a special role for patterns and are known as
magic characters.
( ) . % + - * ? [ ] ^ $
Non-magic characters just stand for themselves. The character
^ stands for
start of string and
$ for
end of string. For example the pattern
"man$" matches any string ending in
man .
i, j = ("woman"):find ("man$")
print(i, " ", j) --> 3 5
The character
. stands for any single character.
So the pattern
"^...$" matches all strings of length 3.
The magic character
% plays the role of an escape
character; this means that single magic characters are matched by the patterns
got by prefixing the magic character with
% . For
example a final dollar sign in a string is matched by
"%$$" . The following combinations have special meanings as
character classes:
%a any letter of the alphabet
%A any character not a letter of the alphabet
%c any control character
%C any character not a control character
%d any digit
%D any character not a digit
%p any punctuation symbol
%P any character not a punctuation symbol
%s any white space character
%S any character not a white space character
%u any uppercase letter
%U any character not an uppercase letter
%w any alphanumeric character
%W any non-alphanumeric character
%x any hexadecimal digit
%X any character not a hexadecimal digit
%z the character with ASCII code 0
%Z any character except ASCII NUL
You can create your own character classes using square brackets.
For example
"[%w_]" matches any alphanumeric
character or underscore. We could express the pattern
"%d as
"[0123456789]" or as
"[0-9]" . Similarly we could express
"%a" as
"[a-zA-Z]" . Expressions for patterns are far from unique.
If the character
^ follows immediately after the
opening square bracket then we get the complementary character class.
For example
"[^%.]" matches all characters
except a full stop.
The magic characters
+ - * and
? are
used as modifiers after a character class, with the following meanings:
+ longest sequence of at least one
- shortest sequence of zero or more
* longest sequence of zero or more
? zero or one
So
"^%d*" matches the longest initial sequence of
digits in a string, including an empty sequence.
The expression
%b xy matches balanced strings that start with the character
x and end with
y.
Typical uses are
%b(), %b<>, %b%% and so on.
The expression
%f followed by a character class
specified by square brackets matches a
frontier between characters not in the class and characters in the class.
Usually we want not only to find whether a pattern is matched
in a string, but also, for each instance of a match,
what the actual matching substrings are for some parts of the pattern.
These substrings are called
captures and we can indicate which
parts of a pattern we wish to capture by enclosing them in parentheses,
the only magic characters we have so far not explained. The captures are
ordered by the position of the opening parenthesis. They can, of course,
be nested. The
find method returns the captures as extra results after the two indices,
if it is successful.
s = [[url="http://www.lua.org"]]
i, j, adr = s:find [[url="(.*)"]]
print(i, " ", j, " ", adr) --> 6 23 http://www.lua.org
You can use the captures in the pattern itself. The expression
% n, where
n is a single digit,
matches the n-th capture. So, for example
quoted = [[(["'])(.-)%1]]
matches quoted strings. The first capture is the quote character,
the second is the content of the string.
How do we cope with lots of matches? The iterator method
gmatch is used as follows:
for x, y, . . . in s:gmatch (pat) do body end -- for
The list of variables should correspond to the captures specified
in the pattern. If no captures are specified in
pat then only one iteration variable is needed and its values
are the substrings that match the whole of
pat .
The string method
gsub is very powerful.
In its simplest form
s_new = s_old:gsub (pattern, replacement)
it searches the string
s_old for matches to
pattern and returns a new string
s_new got by substituting each match by the
replacement string. This can contain captures
%1, %2 and so on if
pattern specifies them. A more powerful
form is where
replacement is a function. The
arguments to it are the captures and the returned value is used as
the replacement string for that particular match.
If
file evaluates to the canonicalized name of a
file that is not in the root directory of a filing system, we can
obtain the directory it lies in as
file:gsub ("%.[^%.]*$","")
because we have simply removed from the file's name everything
after and including the last full stop in it.
Here is a short example program that pulls together lots of different
ideas.
#! lua
-- Gavin Wraith (12/10/2007)
-- add /mp3 suffix to files of type 0x1ad
-- that do not already end in that suffix
-- in the arg[1] directory and all its subdirectories
local sep, suffix, sufpat = ".", "/mp3", "/[mM][pP]3$"
local warn = "Need a directory or an mp3 file, please"
local cmd = "rename %s %s"
local ampeg, dirtype = 0x1ad, 0x1000
local dir, filetype in riscos
local execute in os
local cmdlist = {}
local scan, addsuffix, action
addsuffix = \ (file)
if not file:match (sufpat) then
cmdlist[1 + #cmdlist] = cmd:format (file, file..suffix)
end -- if
end -- function
scan = \ (d)
for leaf, kind in dir (d) do
local func = action[kind]
if func then func (d..sep..leaf) end -- if
end -- for
end -- function
action = { [ampeg] = addsuffix; [dirtype] = scan; }
--- main program
local d = arg[1]
local func = action[filetype (d)]
if func then func (d) else error (warn) end -- if
for _, rename in ipairs (cmdlist) do
execute (rename)
end -- for
The first line's comment is just there so that StrongED knows to use
the Lua mode, when it comes to editing. Then come comments which give
the author and date and what the program is supposed to do.
We start by making some
definitions. The pattern
sufpat will match names ending
in any of the four possibilities
/mp3, /Mp3, /mP3, /MP3 . As is usually
the case with RiscLua programs, it is easier to get at the meaning by
reading them backwards.
The comment
-- main program shows where to start.
We define a local variable
d to be the value of
arg[1] . Why bother? Why not just use
arg[1] ? Because we want to avoid evaluating this expression more than once.
Note that
d is used twice, once in the next line, and
again in the line after that. The next line defines a value
func which is the value of the
action table at the filetype of the filer object given by
arg[1] . We see that the
action table has two
keys, one for MP3 files and another for directories.
The idea is that
func will be a function for dealing with
the requisite filetype: if an MP3 file is given, then we want to use
the function
addsuffix and if it is a directory we want to
use
scan In other words, rather than using conditional tests
with lots of
if statements, we prefer to use a
jump table. This is more efficient, and easier to extend
if we need to cater for more cases. This technique should be used
in circumstances wherever you would use a
switch statement if you were programming in C or a
CASE statement if you were programming in BBC Basic. But what happens if the
value given by
d is neither an MP3 file
nor a directory? In that case
func will not be defined, i.e.
it will be
nil . So, in the next line, we apply
func if it is defined and give a warning otherwise.
Finally, we execute all the renaming commands which have been gathered into
the list
cmdlist by the calls to the
addsuffix function.
An important point is that the three variables
scan, addsuffix, action are all declared local
before they are defined. That is because their
definitions are mutually recursive;
scan depends on
action and
action depends on
scan and
addsuffix . This arrangement can be
taken as a template for scanning directories recursively. The local
variable
dir has the value
riscos.dir which
gives iteration over directories (this is particular to RiscLua - it is
not standard Lua).