Regular Expressions and Recognizer Automata

Peter S. Housel

November 2001


Copyright

This is the Monday regular library, a set of facilities for representing regular expressions and constructing recognizer automata for them.

Copyright ©1997–2001 Peter S. Housel.

This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.

This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA


Introduction

Automata theory defines the concept of a regular expression, a formula describing a language called a regular set, recognizable by a finite-state automaton. Regular expressions provide a convenient way to describe lexical patterns, and are often used for search patterns in text editors and other text processing tools. They are also used to describe lexical tokens for computer languages.

This library provides a collection of object classes for representing regular expressions, and for constructing finite-state automata that recognize the languages they generate.

<Module Dylan-user>=

define library regular
  use common-dylan;
  export regular-expression;
end library;
          
define module regular-expression
  use common-dylan;
  <Exports from the regular-expression module>
end module;
          

Regular Expressions

Here we define a node structure for representing regular expressions.

<Exports from the regular-expression module>=

export
  <regular-expression>,
  copy-regular-expression;
          

<regular-expression>

[Primary Abstract Class]


Class for representing regular expressions.

Superclasses:

<object>

Init-keywords:

None.

Description:

The class of regular expressions.

Definition:

<Module regular-expression>=

define primary abstract class <regular-expression> (<object>)
  <Slots for >
end class;
              

copy-regular-expression

[Generic Function]


Returns a copy of a regular expression.

Signature

copy-regular-expression regexp1regexp2

Arguments:
regexp1
An instance of <regular-expression>.
Values:
regexp2
An instance of <regular-expression>.
Description:

Returns a (deep) copy of a given regular expression.

Definition:

<Module regular-expression>+=

define generic copy-regular-expression
    (regex1 :: <regular-expression>)
 => (regex2 :: <regular-expression>);
              

The rules for regular expressions define the following constructs:

The following classes represent these various regular expression types.

<Exports from the regular-expression module>+=

export
  <epsilon-regular-expression>,
  <symbol-regular-expression>,
  regular-expression-symbol,
  <symbol-set-regular-expression>,
  regular-expression-symbol-set,
  <union-regular-expression>,
  regular-expression-union1,
  regular-expression-union2,
  <concatenation-regular-expression>,
  regular-expression-head,
  regular-expression-tail,
  <closure-regular-expression>,
  regular-expression-enclosed,
  <accept-regular-expression>;
          

<epsilon-regular-expression>

[Class]


Class for representing epsilon regular expressions.

Superclasses:

<regular-expression>

Init-keywords:

None.

Description:

The class of epsilon regular expressions.

Definition:

<Module regular-expression>+=

define class <epsilon-regular-expression>
    (<regular-expression>)
  // no additional slots
end class;
              
define method copy-regular-expression
    (regex1 :: <epsilon-regular-expression>)
 => (regex2 :: <regular-expression>);
  make(<epsilon-regular-expression>);
end method;
              

<symbol-regular-expression>

[Class]


Class for representing symbol regular expressions.

Superclasses:

<regular-expression>

Init-keywords:
symbol:
An instance of <object>.
Description:

The class of regular expressions matching a single symbol.

Definition:

<Module regular-expression>+=

define class <symbol-regular-expression>
    (<regular-expression>)
  constant slot regular-expression-symbol :: <object>,
    required-init-keyword: symbol:;
end class;
              
define method copy-regular-expression
    (regex1 :: <symbol-regular-expression>)
 => (regex2 :: <regular-expression>);
  make(<symbol-regular-expression>,
         symbol: regex1.regular-expression-symbol);
end method;
              

<symbol-set-regular-expression>

[Class]


Class for representing regular expressions matching a set of distinct symbols.

Superclasses:

<regular-expression>

Init-keywords:
symbol-set:
An instance of <collection>.
Description:

The class of regular expressions matching a one of a set of distinct symbols.

Definition:

<Module regular-expression>+=

define class <symbol-set-regular-expression> (<regular-expression>)
  constant slot regular-expression-symbol-set :: <collection>,
    required-init-keyword: symbol-set:;
end class;
              
define method copy-regular-expression
    (regex1 :: <symbol-set-regular-expression>)
 => (regex2 :: <regular-expression>);
  make(<symbol-set-regular-expression>,
         symbol-set: regex1.regular-expression-symbol-set);
end method;
              

<union-regular-expression>

[Class]


Class for representing union regular expressions.

Superclasses:

<regular-expression>

Init-keywords:
union1:
An instance of <regular-expression>.
union2:
An instance of <regular-expression>.
Description:

The class of regular expressions matching one of two alternative subexpressions.

Definition:

<Module regular-expression>+=

define class <union-regular-expression>
    (<regular-expression>)
  constant slot regular-expression-union1,
    required-init-keyword: union1:;
  constant slot regular-expression-union2,
    required-init-keyword: union2:;
end class;
              
define method copy-regular-expression
    (regex1 :: <union-regular-expression>)
 => (regex2 :: <regular-expression>);
  make(<union-regular-expression>,
         union1: copy-regular-expression(regex1.regular-expression-union1),
         union2: copy-regular-expression(regex1.regular-expression-union2));
end method;
              

<concatenation-regular-expression>

[Class]


Class for representing concatenation regular expressions.

Superclasses:

<regular-expression>

Init-keywords:
head:
An instance of <regular-expression>.
tail:
An instance of <regular-expression>.
Description:

The class of regular expressions matching one regular expression followed by another.

Definition:

<Module regular-expression>+=

define class <concatenation-regular-expression>
    (<regular-expression>)
  constant slot regular-expression-head,
    required-init-keyword: head:;
  constant slot regular-expression-tail,
    required-init-keyword: tail:;
end class;
              
define method copy-regular-expression
    (regex1 :: <concatenation-regular-expression>)
 => (regex2 :: <regular-expression>);
  make(<concatenation-regular-expression>,
         head: copy-regular-expression(regex1.regular-expression-head),
         tail: copy-regular-expression(regex1.regular-expression-tail));
end method;
              

<closure-regular-expression>

[Class]


Class for representing Kleene closure regular expressions.

Superclasses:

<regular-expression>

Init-keywords:
of:
An instance of <object>.
Description:

The class of regular expressions matching the Kleene closure of a regular expression.

Definition:

<Module regular-expression>+=

define class <closure-regular-expression>
    (<regular-expression>)
  constant slot regular-expression-enclosed :: <object>,
    required-init-keyword: of:;
end class;
              
define method copy-regular-expression
    (regex1 :: <closure-regular-expression>)
 => (regex2 :: <regular-expression>);
  make(<closure-regular-expression>,
         of: copy-regular-expression(regex1.regular-expression-enclosed));
end method;
              

We also need an additional type of regular expression node to mark the end of a regular expression. Semantic actions associated with the matching of a regular expression can be attached to these nodes.

<accept-regular-expression>

[Open Class]


Class for representing a marker regular expression indicating that an expression has been matched.

Superclasses:

<regular-expression>

Init-keywords:

None.

Description:

The class of regular expressions indicating a successful match.

Definition:

<Module regular-expression>+=

define open class <accept-regular-expression>
    (<regular-expression>)
  // no additional slots
end class;
              

The Glushkov Automaton

One important purpose for building a representation of a regular expression is, of course, to build a deterministic state machine that recognizes it. One of the ways we can do this is to first construct a nondeterministic finite automaton (NFA) called a Glushkov automaton, and then convert this automaton to a deterministic finite automaton using the subset construction algorithm. This method is described in section 3.9 of .

The Glushkov automaton is constructed by treating each <symbol-regular-expression> (or <symbol-set-regular-expression>) node as a position in a path taken through the regular expression during the matching of a string. The states of the of the automaton correspond to sets of positions possible at that point, and its transitions correspond to paths between symbols that can be adjacent on a path through the expression. The construction algorithm makes use of a a function on regular expression nodes we'll call regular-expression-following-positions; this function in turn is constructed making use of the functions regular-expression-nullable? (which returns true for a regular expression if the empty string can match it), regular-expression-first-positions (which returns the set of symbol nodes whose characters can appear at the beginning of a matching string), and regular-expression-last-positions (which returns the set of leaf nodes whose characters can appear at the end of a matching string).

The nifty thing is that we can compute these functions while the <regular-expression> representation is being built, making a separate pass unnecessary. We'll define slots to store the function values in the base class, and compute the values within the initialize methods of the subclasses (effectively a depth-first traversal).

<Slots for >=

slot regular-expression-nullable? :: <boolean> = #f;
slot regular-expression-first-positions :: <list> = #();
slot regular-expression-last-positions :: <list> = #();
slot regular-expression-follow-positions :: <list> = #();
          

We'll go through the <regular-expression> subclasses one by one. The first type, for epsilon regular expressions, is the easiest because epsilon is trivially nullable, and the first and last sets are empty.

<Module regular-expression>+=

define sealed method initialize
   (instance :: <epsilon-regular-expression>, #next next-method, #key)
  next-method();
  instance.regular-expression-nullable? := #t;
end method initialize;
        

Symbol nodes and symbol set nodes are not nullable, and the only node that could match at the beginning or the end of such an expression is the node itself.

<Module regular-expression>+=

define sealed method initialize
    (instance :: <symbol-regular-expression>, #next next-method, #key)
  next-method();
  instance.regular-expression-first-positions
    := instance.regular-expression-last-positions
    := list(instance);
end method initialize;
        
define sealed method initialize
    (instance :: <symbol-set-regular-expression>,
     #next next-method, #key)
  next-method();
  instance.regular-expression-first-positions
    := instance.regular-expression-last-positions
    := list(instance);
end method initialize;
        

Accept nodes are the same as symbol nodes in this respect.

<Module regular-expression>+=

define method initialize
    (instance :: <accept-regular-expression>, #next next-method, #key)
  next-method();
  instance.regular-expression-first-positions
    := instance.regular-expression-last-positions
    := list(instance);
end method initialize;
        

Union nodes are nullable if either alternative is, and the first and last sets are the unions of the sets of the alternatives. Note that these are both unions of non-overlapping sets.

<Module regular-expression>+=

define sealed method initialize
    (instance :: <union-regular-expression>, #next next-method, #key)
  next-method();
  let union1 = instance.regular-expression-union1;
  let union2 = instance.regular-expression-union2;
  instance.regular-expression-nullable?
    := regular-expression-nullable?(union1)
         | regular-expression-nullable?(union2);
  instance.regular-expression-first-positions
    := concatenate(regular-expression-first-positions(union1),
                   regular-expression-first-positions(union2));
  instance.regular-expression-last-positions
    := concatenate(regular-expression-last-positions(union1),
                   regular-expression-last-positions(union2));
end method initialize;
        

Concatenation nodes are nullible only if both of the children are. The regular-expression-first-positions set of the concatenation node is the same as that of of its first child, unless the first child is nullable, in which case the set is the union of the two. The opposite is the case for the regular-expression-last-positions set.

<Module regular-expression>+=

define sealed method initialize
    (instance :: <concatenation-regular-expression>,
     #next next-method, #key)
  next-method();
  let re-head = instance.regular-expression-head;
  let re-tail = instance.regular-expression-tail;
  instance.regular-expression-nullable?
    := regular-expression-nullable?(re-head)
         & regular-expression-nullable?(re-tail);
  instance.regular-expression-first-positions
    := if(regular-expression-nullable?(re-head))
         concatenate(re-head.regular-expression-first-positions,
                     re-tail.regular-expression-first-positions);
       else
         re-head.regular-expression-first-positions;
       end if;
  instance.regular-expression-last-positions
    := if(regular-expression-nullable?(re-tail))
         concatenate(re-head.regular-expression-last-positions,
                     re-tail.regular-expression-last-positions);
       else
         re-tail.regular-expression-last-positions;
       end if;
  <Compute the regular-expression-follow-positions set for the children of a <concatenation-regular-expression>>
end method initialize;
        

For each of the nodes in the regular-expression-last-positions set of the first child of the concatenate-node, we know that its regular-expression-follow-positions set includes all of the nodes in the regular-expression-first-positions set of the second child. This is also a disjoint union.

<Compute the regular-expression-follow-positions set for the children of a <concatenation-regular-expression>>=

let followers :: <list> = re-tail.regular-expression-first-positions;
for (node in re-head.regular-expression-last-positions)
  node.regular-expression-follow-positions
    := concatenate(node.regular-expression-follow-positions, followers);
end for;
        

Lastly, we do the computations for closure nodes. Closure nodes are nullable, since they represent zero or more repetitions. The regular-expression-first-positions and regular-expression-last-positions sets are not affected by the repetition.

<Module regular-expression>+=

define method initialize
    (instance :: <closure-regular-expression>, #next next-method, #key)
  next-method();
  instance.regular-expression-nullable? := #t;
  instance.regular-expression-first-positions 
    := instance.regular-expression-enclosed.regular-expression-first-positions;
  instance.regular-expression-last-positions 
    := instance.regular-expression-enclosed.regular-expression-last-positions;
  <Compute the regular-expression-follow-positions set for the child of a <closure-regular-expression>>
end method;
        

Since the star-node is a repetition, then for each of the nodes in the regular-expression-last-positions set of the starred expression, its regular-expression-follow-positions set includes all of the nodes in the regular-expression-first-positions of the same expression. This is not a disjoint union.

<Compute the regular-expression-follow-positions set for the child of a <closure-regular-expression>>=

let enclosed = instance.regular-expression-enclosed;
let followers :: <list> = enclosed.regular-expression-first-positions;
for (node in enclosed.regular-expression-last-positions)
  node.regular-expression-follow-positions
    := union(node.regular-expression-follow-positions, followers, test: \==);
end for;
          

Deterministic Automata

For efficiently recognizing matches of a regular expression, we need to construct a corresponding deterministic automaton.

<Exports from the regular-expression module>+=

export
  <regular-expression-dfa-state>,
  regular-expression-dfa-state-transitions;
          

<regular-expression-dfa-state>

[Class]


Class of deterministic finite automata states constructed from a <regular-expression> object.

Superclasses:

<object>

Init-keywords:
transitions:
An instance of <collection>.
Description:

Describes states in a deterministic finite automaton constructed from a <regular-expression> object.

Definition:

<Module regular-expression>+=

define open class <regular-expression-dfa-state> (<object>)
  constant slot regular-expression-dfa-state-transitions
      :: <mutable-collection>,
    required-init-keyword: transitions:;
end class;
              

Constructing a DFA from the Glushkov Automaton

It's fairly straightforward to construct a DFA from the Glushkov automaton using the subset construction algorithm.

What we've built with our regular-expression-follow-positions function is a directed graph that amounts to a Nondeterministic Finite Automation (NFA) that recognizes the regular expression. All of the leaf nodes in the regular-expression-first-positions of the root of the regular expression node tree correspond to start states of the NFA. Each of the nodes in the regular-expression-follow-positions of a symbol or accept node correspond to transitions that can be made from each state.

The subset construction algorithm takes an NFA and constructs a corresponding DFA (Deterministic Finite Automation) by mapping each possible set of NFA states into a single DFA state. This theoretically could blow up into an exponential number of DFA states, but for most practical lexical analyzers, this usually doesn't happen.

<Exports from the regular-expression module>+=

export regular-expression-dfa;
          

regular-expression-dfa

[Function]


Returns a deterministic automaton corresponding to a regular expression.

Signature:

regular-expression-dfa regular-expression #key deterministic? transition-collection-class transition-collection-size state-classdfa num-dfa-states

Arguments:
regular-expression
An instance of <regular-expression>.
deterministic?
An instance of <boolean>. The default is #f.
transition-collection-class
An instance of subclass(<mutable-collection>). The default is <object-table>.
transition-collection-size
An instance of <integer>, or #f. The default is #f.
state-class
An instance of subclass(<regular-expression-dfa-state>). The default is <regular-expression-dfa-state>.
Values:
dfa
An instance of <regular-expression-dfa-state>.
num-dfa-states
An instance of <integer>.
Description:

Returns the start state of a deterministic finite automaton corresponding to regular-expression.

If deterministic? is true, then the automaton is required to be deterministic as defined in . If it is not, then an error is signalled.

The transition-collection-class is instantiated to initialize the regular-expression-dfa-state-transitions slot in each state. This allows regular expressions using <integer> symbols to use <vector< instances for transition tables, and for other types to use an <explicit-key-collection>.

Definition:

<Module regular-expression>+=

define function regular-expression-dfa
    (regular-expression :: <regular-expression>,
     #key deterministic? = #f,
          transition-collection-class = <object-table>,
          transition-collection-size :: false-or(<integer>) = #f,
          state-class :: subclass(<regular-expression-dfa-state>)
            = <regular-expression-dfa-state>)
 => (start-state :: <regular-expression-dfa-state>,
     num-dfa-states :: <integer>);
  let worklist :: <deque> = make(<deque>);
  <Define the states table and locate-state>

  let start-state
    = locate-state(regular-expression.regular-expression-first-positions);
  while(~empty?(worklist))
    <Set label and state from the worklist>
    <Process each position in label for state>
    <Finalize the transitions for state>
  end while;
  values(start-state, states.size);
end function;
              

Since we could end up with a fairly large number of states, we need to have a way of locating them quickly. Here we define a table class for DFA states, with the sets of corresponding positions in the Glushkov automaton as keys. Dylan allows us to do this by subclassing <table> and specializing the function table-protocol to return key equality and hash functions.

<Module regular-expression>+=

define class <set-table> (<table>)
  // No additional slots.
end class;
          
define method table-protocol(table :: <set-table>)
  => (test-function :: <function>, hash-function :: <function>);
  values(method(set1 :: <list>, set2 :: <list>)
            => (equivalent?);
           <Compare sets set1 and set2 for equality>
         end method,
         method(set :: <list>, initial-state :: <object>)
            => (hash-id :: <integer>, hash-state :: <object>);
           <Compute the hash value for set>
         end method);
end method;
          

Two sets are equal if they have the same number of elements and every element of the first set is also a member of the second set.

<Compare sets set1 and set2 for equality>=

set1.size = set2.size
  & every?(rcurry(member?, set2), set1);
          

The hash function simply combines the hash values of all of the individual items, insuring that their order within the sequence is irrelevant.

<Compute the hash value for set>=

if (empty?(set))
  values(0, initial-state)
else
  let (hash-id :: <integer>, hash-state :: <object>)
    = object-hash(first(set), initial-state);

  for(item in set, first? = #t then #f)
    if (~first?)
      let (item-hash-id :: <integer>,
           item-hash-state :: <object>)
        = object-hash(item, hash-state);
      hash-id := merge-hash-ids(hash-id, item-hash-id, ordered: #f);
      hash-state := item-hash-state;
    end if;
  end for;

  values(hash-id, hash-state);
end if;
          

Given this, we can define a table for looking up states and the local locate-state method. If locate-state finds a labeled state in the table it returns it, otherwise it creates a new instance and adds it to the table and (along with its label) to the worklist.

<Define the states table and locate-state>=

let states :: <set-table> = make(<set-table>);
local
  method locate-state
       (label :: <list>)
    => (state :: <regular-expression-dfa-state>);
    let result = element(states, label, default: #f);
    if(result)
      result;
    else
      let transitions
        = if (transition-collection-size)
            make(transition-collection-class,
                 size: transition-collection-size);
          else
            make(transition-collection-class);
          end if;
      let state = make(state-class, transitions: transitions);
      states[label] := state;
      push-last(worklist, state);
      push-last(worklist, label);
      state;
    end if;
  end method;
          

We retrieve states from the front end of the worklist.

<Set label and state from the worklist>=

let state = pop(worklist);
let label = pop(worklist);
          

When processing a state, we look at all of the positions in its label, and determine what transitions they engender.

<Process each position in label for state>=

for(position in label)
  do-regular-expression-dfa-state-position(state, position,
                                           deterministic?: deterministic?);
end for;
          

Making do-regular-expression-dfa-state-position an open generic allows applications to customize the handling of accepting states. The method for symbol positions computes state transitions by using the state's regular-expression-dfa-state-transitions as temporary storage for the labels of successor states. If we are requiring that the automaton be deterministic, then only one position is allowed to determine the transition on a given symbol for any DFA state.

<Exports from the regular-expression module>+=

export do-regular-expression-dfa-state-position;
          

<Module regular-expression>+=

define open generic do-regular-expression-dfa-state-position
    (state :: <regular-expression-dfa-state>,
     position :: <regular-expression>,
     #key deterministic?)
 => ();
          
define sealed method do-regular-expression-dfa-state-position
    (state :: <regular-expression-dfa-state>,
     position :: <symbol-regular-expression>,
     #key deterministic? = #f)
 => ();
  let follow = position.regular-expression-follow-positions;
  let symbol = position.regular-expression-symbol;
  let present = element(state.regular-expression-dfa-state-transitions,
                        symbol, default: #f);
  if(present)
    if(deterministic?)
      error("regular expression is non-deterministic on symbol %=", symbol);
    else
      element(state.regular-expression-dfa-state-transitions, symbol)
        := union(present, follow);
    end if;
  else
    element(state.regular-expression-dfa-state-transitions, symbol)
      := follow;
  end;
end method;
          
define sealed method do-regular-expression-dfa-state-position
    (state :: <regular-expression-dfa-state>,
     position :: <symbol-set-regular-expression>,
     #key deterministic? = #f)
 => ();
  let follow = position.regular-expression-follow-positions;
  for(symbol in position.regular-expression-symbol-set)
    let present = element(state.regular-expression-dfa-state-transitions,
                          symbol, default: #f);
    if(present)
      if(deterministic?)
        error("regular expression is non-deterministic on symbol %=", symbol);
      else
        element(state.regular-expression-dfa-state-transitions, symbol)
          := union(present, follow);
      end if;
    else
     element(state.regular-expression-dfa-state-transitions, symbol)
        := follow;
    end;
  end for;
end method;
          

We finalize the state by converting the elements in the transition collection from state labels to states.

<Finalize the transitions for state>=

let transitions = state.regular-expression-dfa-state-transitions;
let (initial-state, limit, next-state, finished-state?, current-key,
     current-element, current-element-setter, copy-state)
  = forward-iteration-protocol(transitions);
for(iteration = initial-state then next-state(transitions, iteration),
    until: finished-state?(transitions, iteration, limit))
  if(current-element(transitions, iteration))
    current-element(transitions, iteration)
      := locate-state(current-element(transitions, iteration));
  end if;
end for;
          

Bibliography