Backtrack

If A and B are patterns the pattern A | B (also written A + B in Lpeg) first tries to match A and if that fails it tries B. If we suppose that the code for our patterns are subroutines then the code for A | B can be obtained by following the code for A with that of B, replacing the failure-exit of A with an instruction to reset the text-pointer back to its starting position and then dropping through to the code for B. A similar set of modifications apply to multiple alternatives

       A | B | C ......

with each one but the last dropping through, on failure, to the next, with resetting of the text-pointer.

In the same way, the code for !A may be got by swapping the succes and failure exits, and by adding an instruction to reset the text-pointer before the success exit.

In matching a pattern to a text one often wants to capture information. What sort of information will clearly depend strongly on the language in which one's pattern matching code is embedded. But in the simplest case one might consider patterns that always succeed, but which as a side effect push the current text pointer to a stack. From these pointer values we can then extract substrings from the text. For simplicity we would only use these capture patterns at the top level and outside repetitions.

If non-terminals correspond to subroutines, then a production in a grammar, of the form

    X = U  V  W ....

might be coded as

    X_out    LDR PC, [stack], #4
    X_enter  STR PC, [stack, #-4]!
             BL U_enter
             TST status
             BEQ X_out
             BL V_enter
             TST status
             BEQ X_out
             BL W_enter
             TST status
             BEQ X_out
             ...

so that the tree-structure of the algebra would be reflected by the call-structure of the code. The subroutines would not save the register TXT_PTR used for the current text-pointer. The status register is here represented as being zero for failure and non-zero for success. It might be more efficient to minimize the branch-with-link instructions and instead use inline code

    X_out    LDR PC, [stack], #4
    X_enter  STR PC, [stack, #-4]!
             <U's code>
             TST status
             BEQ X_out
             <V's code>
             TST status
             BEQ X_out
             <W's code>
             TST status
             BEQ X_out
             ...

The same considerations apply to productions of the form

    X = U | V | W ....

which might be coded as

    X_out    LDR PC, [stack], #4
    X_enter  STR PC, [stack, #-4]!
             MOV START, TXT_PTR
             BL U_enter
             TST status
             BNE X_out
             MOV TXT_PTR,START
             BL V_enter
             TST status
             BNE X_out
             MOV TXT_PTR,START
             BL W_enter
             TST status
             BNE X_out
             ...

Here START would be a register for local use, not used by the subroutines, for restoring the text-pointer after failure.

Conversely there may also be situations when it pays for the compiler to introduce more non-terminals in order to simplify complex expressions.

back

contents