Patterns

When we consider a character in a string we have two pieces of information: what is the character's ASCII code, and what is the index of its position. In a word what and where. In our case that is 256 times #text-many bits and so we are dealing with two the power of that for the number of statements that we can say about the string. This can be a large number. However we can look at some useful families of statements.

Properties of characters

An array of 256 Boolean values determines a character-class. For example, the upper-case characters, the control-characters, and so on. Given a character-class and an index i we may ask whether the i-th character belongs to that class. So each class determines a pattern which succeeds on the longest prefix of the text whose characters belong to that class.

The largest character-class, that contains any character, is often denoted by a dot.

The notation

      [ABC... ]
Is often used for the union of the character classes A,B,C,... and

      [^ABC... ]
For the negation of that union.

String Patterns

If s is a string, we have the pattern that matches only those texts that have s as a prefix.

  match (P(s), text)  --> true, text + #s   if text starts with s
                      --> false             otherwise
A frequent abuse of notation is to call P(s) simply s. More important than individual patterns is to consider how patterns can be combined.

Sequence

We can combine patterns in a sequence, each one starting where the previous one left off. This is often thought of as multiplication so notations like A*B or A B are often used. It is associative (but not commutative). If A fails then A*B fails. If A succeeds and

     match (A, text) = true, text'
then

     match (A*B, text) = match (B, text')
So if B fails A*B will fail but will still update the text pointer to the first character after the prefix on which A succeeds.

Repetition

If n is a non-negative integer then A^n matches when a sequence of at least A's match, and A^-n matches when a sequence of at most A's match. The standard notations are

     A*       <-->    A^0
     A+       <-->    A^1
     A?       <-->    A^-1

Negation

!A succeeds only when A fails. The pointer is not updated. Note that !!! has the same effect as !.