Regular Expressions and Recognizer Automata
This is the Monday regular library, a set of
facilities for representing regular expressions and
constructing recognizer automata for them.
Copyright ©1997–2001 Peter S. Housel.
This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.
This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
Automata theory defines the concept of a regular expression, a formula describing a language called a regular set, recognizable by a finite-state automaton. Regular expressions provide a convenient way to describe lexical patterns, and are often used for search patterns in text editors and other text processing tools. They are also used to describe lexical tokens for computer languages.
This library provides a collection of object classes for representing regular expressions, and for constructing finite-state automata that recognize the languages they generate.
<Module Dylan-user>=
define library regular
use common-dylan;
export regular-expression;
end library;
define module regular-expression
use common-dylan;
<Exports from the regular-expression module>
end module;
Here we define a node structure for representing regular expressions.
<Exports from the regular-expression module>=
export
<regular-expression>,
copy-regular-expression;
|
[Primary Abstract Class] |
Class for representing regular expressions.
<object>
None.
The class of regular expressions.
<Module regular-expression>=
define primary abstract class <regular-expression> (<object>)
<Slots for >
end class;
|
[Generic Function] |
Returns a copy of a regular expression.
copy-regular-expression
regexp1
⇒
regexp2
regexp1<regular-expression>.regexp2<regular-expression>.Returns a (deep) copy of a given regular expression.
<Module regular-expression>+=
define generic copy-regular-expression
(regex1 :: <regular-expression>)
=> (regex2 :: <regular-expression>);
The rules for regular expressions define the following constructs:
The following classes represent these various regular expression types.
<Exports from the regular-expression module>+=
export
<epsilon-regular-expression>,
<symbol-regular-expression>,
regular-expression-symbol,
<symbol-set-regular-expression>,
regular-expression-symbol-set,
<union-regular-expression>,
regular-expression-union1,
regular-expression-union2,
<concatenation-regular-expression>,
regular-expression-head,
regular-expression-tail,
<closure-regular-expression>,
regular-expression-enclosed,
<accept-regular-expression>;
|
[Class] |
Class for representing epsilon regular expressions.
<regular-expression>
None.
The class of epsilon regular expressions.
<Module regular-expression>+=
define class <epsilon-regular-expression>
(<regular-expression>)
// no additional slots
end class;
define method copy-regular-expression
(regex1 :: <epsilon-regular-expression>)
=> (regex2 :: <regular-expression>);
make(<epsilon-regular-expression>);
end method;
|
[Class] |
Class for representing symbol regular expressions.
<regular-expression>
symbol:<object>.The class of regular expressions matching a single symbol.
<Module regular-expression>+=
define class <symbol-regular-expression>
(<regular-expression>)
constant slot regular-expression-symbol :: <object>,
required-init-keyword: symbol:;
end class;
define method copy-regular-expression
(regex1 :: <symbol-regular-expression>)
=> (regex2 :: <regular-expression>);
make(<symbol-regular-expression>,
symbol: regex1.regular-expression-symbol);
end method;
|
[Class] |
Class for representing regular expressions matching a set of distinct symbols.
<regular-expression>
symbol-set:<collection>.The class of regular expressions matching a one of a set of distinct symbols.
<Module regular-expression>+=
define class <symbol-set-regular-expression> (<regular-expression>)
constant slot regular-expression-symbol-set :: <collection>,
required-init-keyword: symbol-set:;
end class;
define method copy-regular-expression
(regex1 :: <symbol-set-regular-expression>)
=> (regex2 :: <regular-expression>);
make(<symbol-set-regular-expression>,
symbol-set: regex1.regular-expression-symbol-set);
end method;
|
[Class] |
Class for representing union regular expressions.
<regular-expression>
union1:<regular-expression>.union2:<regular-expression>.The class of regular expressions matching one of two alternative subexpressions.
<Module regular-expression>+=
define class <union-regular-expression>
(<regular-expression>)
constant slot regular-expression-union1,
required-init-keyword: union1:;
constant slot regular-expression-union2,
required-init-keyword: union2:;
end class;
define method copy-regular-expression
(regex1 :: <union-regular-expression>)
=> (regex2 :: <regular-expression>);
make(<union-regular-expression>,
union1: copy-regular-expression(regex1.regular-expression-union1),
union2: copy-regular-expression(regex1.regular-expression-union2));
end method;
|
[Class] |
Class for representing concatenation regular expressions.
<regular-expression>
head:<regular-expression>.tail:<regular-expression>.The class of regular expressions matching one regular expression followed by another.
<Module regular-expression>+=
define class <concatenation-regular-expression>
(<regular-expression>)
constant slot regular-expression-head,
required-init-keyword: head:;
constant slot regular-expression-tail,
required-init-keyword: tail:;
end class;
define method copy-regular-expression
(regex1 :: <concatenation-regular-expression>)
=> (regex2 :: <regular-expression>);
make(<concatenation-regular-expression>,
head: copy-regular-expression(regex1.regular-expression-head),
tail: copy-regular-expression(regex1.regular-expression-tail));
end method;
|
[Class] |
Class for representing Kleene closure regular expressions.
<regular-expression>
of:<object>.The class of regular expressions matching the Kleene closure of a regular expression.
<Module regular-expression>+=
define class <closure-regular-expression>
(<regular-expression>)
constant slot regular-expression-enclosed :: <object>,
required-init-keyword: of:;
end class;
define method copy-regular-expression
(regex1 :: <closure-regular-expression>)
=> (regex2 :: <regular-expression>);
make(<closure-regular-expression>,
of: copy-regular-expression(regex1.regular-expression-enclosed));
end method;
We also need an additional type of regular expression node to mark the end of a regular expression. Semantic actions associated with the matching of a regular expression can be attached to these nodes.
|
[Open Class] |
Class for representing a marker regular expression indicating that an expression has been matched.
<regular-expression>
None.
The class of regular expressions indicating a successful match.
<Module regular-expression>+=
define open class <accept-regular-expression>
(<regular-expression>)
// no additional slots
end class;
One important purpose for building a representation of a regular expression is, of course, to build a deterministic state machine that recognizes it. One of the ways we can do this is to first construct a nondeterministic finite automaton (NFA) called a Glushkov automaton, and then convert this automaton to a deterministic finite automaton using the subset construction algorithm. This method is described in section 3.9 of .
The Glushkov automaton is constructed by treating each
<symbol-regular-expression> (or
<symbol-set-regular-expression>) node as a
position in a path taken through the regular
expression during the matching of a string. The states of the
of the automaton correspond to sets of positions possible at
that point, and its transitions correspond to paths between
symbols that can be adjacent on a path through the expression.
The construction algorithm makes use of a a function on
regular expression nodes we'll call
regular-expression-following-positions; this
function in turn is constructed making use of the functions
regular-expression-nullable? (which returns true
for a regular expression if the empty string can match it),
regular-expression-first-positions (which returns
the set of symbol nodes whose characters can appear at the
beginning of a matching string), and
regular-expression-last-positions (which returns
the set of leaf nodes whose characters can appear at the end
of a matching string).
The nifty thing is that we can compute these functions
while the <regular-expression>
representation is being built, making a separate pass
unnecessary. We'll define slots to store the function values
in the base class, and compute the values within the
initialize methods of the subclasses (effectively
a depth-first traversal).
<Slots for >=
slot regular-expression-nullable? :: <boolean> = #f;
slot regular-expression-first-positions :: <list> = #();
slot regular-expression-last-positions :: <list> = #();
slot regular-expression-follow-positions :: <list> = #();
We'll go through the
<regular-expression> subclasses one by one.
The first type, for epsilon regular expressions, is the
easiest because epsilon is trivially nullable, and the first
and last sets are empty.
<Module regular-expression>+=
define sealed method initialize
(instance :: <epsilon-regular-expression>, #next next-method, #key)
next-method();
instance.regular-expression-nullable? := #t;
end method initialize;
Symbol nodes and symbol set nodes are not nullable, and the only node that could match at the beginning or the end of such an expression is the node itself.
<Module regular-expression>+=
define sealed method initialize
(instance :: <symbol-regular-expression>, #next next-method, #key)
next-method();
instance.regular-expression-first-positions
:= instance.regular-expression-last-positions
:= list(instance);
end method initialize;
define sealed method initialize
(instance :: <symbol-set-regular-expression>,
#next next-method, #key)
next-method();
instance.regular-expression-first-positions
:= instance.regular-expression-last-positions
:= list(instance);
end method initialize;
Accept nodes are the same as symbol nodes in this respect.
<Module regular-expression>+=
define method initialize
(instance :: <accept-regular-expression>, #next next-method, #key)
next-method();
instance.regular-expression-first-positions
:= instance.regular-expression-last-positions
:= list(instance);
end method initialize;
Union nodes are nullable if either alternative is, and the first and last sets are the unions of the sets of the alternatives. Note that these are both unions of non-overlapping sets.
<Module regular-expression>+=
define sealed method initialize
(instance :: <union-regular-expression>, #next next-method, #key)
next-method();
let union1 = instance.regular-expression-union1;
let union2 = instance.regular-expression-union2;
instance.regular-expression-nullable?
:= regular-expression-nullable?(union1)
| regular-expression-nullable?(union2);
instance.regular-expression-first-positions
:= concatenate(regular-expression-first-positions(union1),
regular-expression-first-positions(union2));
instance.regular-expression-last-positions
:= concatenate(regular-expression-last-positions(union1),
regular-expression-last-positions(union2));
end method initialize;
Concatenation nodes are nullible only if both of the children
are. The regular-expression-first-positions set of
the concatenation node is the same as that of of its first
child, unless the first child is nullable, in which case the set
is the union of the two. The opposite is the case for the
regular-expression-last-positions set.
<Module regular-expression>+=
define sealed method initialize
(instance :: <concatenation-regular-expression>,
#next next-method, #key)
next-method();
let re-head = instance.regular-expression-head;
let re-tail = instance.regular-expression-tail;
instance.regular-expression-nullable?
:= regular-expression-nullable?(re-head)
& regular-expression-nullable?(re-tail);
instance.regular-expression-first-positions
:= if(regular-expression-nullable?(re-head))
concatenate(re-head.regular-expression-first-positions,
re-tail.regular-expression-first-positions);
else
re-head.regular-expression-first-positions;
end if;
instance.regular-expression-last-positions
:= if(regular-expression-nullable?(re-tail))
concatenate(re-head.regular-expression-last-positions,
re-tail.regular-expression-last-positions);
else
re-tail.regular-expression-last-positions;
end if;
<Compute the regular-expression-follow-positions set for the children of a <concatenation-regular-expression>>
end method initialize;
For each of the nodes in the
regular-expression-last-positions set of the first
child of the concatenate-node, we know that its
regular-expression-follow-positions set includes
all of the nodes in the
regular-expression-first-positions set of the second
child. This is also a disjoint union.
<Compute the regular-expression-follow-positions set for the children of a <concatenation-regular-expression>>=
let followers :: <list> = re-tail.regular-expression-first-positions;
for (node in re-head.regular-expression-last-positions)
node.regular-expression-follow-positions
:= concatenate(node.regular-expression-follow-positions, followers);
end for;
Lastly, we do the computations for closure nodes. Closure
nodes are nullable, since they represent zero or more
repetitions. The
regular-expression-first-positions and
regular-expression-last-positions sets are not
affected by the repetition.
<Module regular-expression>+=
define method initialize
(instance :: <closure-regular-expression>, #next next-method, #key)
next-method();
instance.regular-expression-nullable? := #t;
instance.regular-expression-first-positions
:= instance.regular-expression-enclosed.regular-expression-first-positions;
instance.regular-expression-last-positions
:= instance.regular-expression-enclosed.regular-expression-last-positions;
<Compute the regular-expression-follow-positions set for the child of a <closure-regular-expression>>
end method;
Since the star-node is a repetition, then for each of the
nodes in the regular-expression-last-positions
set of the starred expression, its
regular-expression-follow-positions set includes
all of the nodes in the
regular-expression-first-positions of the same
expression. This is
<Compute the regular-expression-follow-positions set for the child of a <closure-regular-expression>>=
let enclosed = instance.regular-expression-enclosed;
let followers :: <list> = enclosed.regular-expression-first-positions;
for (node in enclosed.regular-expression-last-positions)
node.regular-expression-follow-positions
:= union(node.regular-expression-follow-positions, followers, test: \==);
end for;
For efficiently recognizing matches of a regular expression, we need to construct a corresponding deterministic automaton.
<Exports from the regular-expression module>+=
export
<regular-expression-dfa-state>,
regular-expression-dfa-state-transitions;
|
[Class] |
Class of deterministic finite automata states constructed
from a <regular-expression> object.
<object>
transitions:<collection>.Describes states in a deterministic finite automaton
constructed from a <regular-expression>
object.
<Module regular-expression>+=
define open class <regular-expression-dfa-state> (<object>)
constant slot regular-expression-dfa-state-transitions
:: <mutable-collection>,
required-init-keyword: transitions:;
end class;
It's fairly straightforward to construct a DFA from the Glushkov automaton using the subset construction algorithm.
What we've built with our
regular-expression-follow-positions function is a
directed graph that amounts to a Nondeterministic Finite
Automation (NFA) that recognizes the regular expression. All
of the leaf nodes in the
regular-expression-first-positions of the root of
the regular expression node tree correspond to start states of
the NFA. Each of the nodes in the
regular-expression-follow-positions of a symbol
or accept node correspond to transitions that can be made from
each state.
The subset construction algorithm takes an NFA and constructs a corresponding DFA (Deterministic Finite Automation) by mapping each possible set of NFA states into a single DFA state. This theoretically could blow up into an exponential number of DFA states, but for most practical lexical analyzers, this usually doesn't happen.
<Exports from the regular-expression module>+=
export regular-expression-dfa;
|
[Function] |
Returns a deterministic automaton corresponding to a regular expression.
regular-expression-dfa
regular-expression
#key
deterministic?
transition-collection-class
transition-collection-size
state-class
⇒
dfa
num-dfa-states
regular-expression<regular-expression>.deterministic?<boolean>. The
default is #f.transition-collection-classsubclass(<mutable-collection>).
The default is <object-table>.transition-collection-size<integer>, or #f.
The default is #f.state-classsubclass(<regular-expression-dfa-state>).
The default is
<regular-expression-dfa-state>.dfa<regular-expression-dfa-state>.num-dfa-states<integer>.Returns the start state of a deterministic finite
automaton corresponding to
regular-expression.
If deterministic? is true, then the
automaton is required to be deterministic as defined in
. If it is not, then an error is
signalled.
The transition-collection-class is
instantiated to initialize the
regular-expression-dfa-state-transitions slot
in each state. This allows regular expressions using
<integer> symbols to use
<vector< instances for transition
tables, and for other types to use an
<explicit-key-collection>.
<Module regular-expression>+=
define function regular-expression-dfa
(regular-expression :: <regular-expression>,
#key deterministic? = #f,
transition-collection-class = <object-table>,
transition-collection-size :: false-or(<integer>) = #f,
state-class :: subclass(<regular-expression-dfa-state>)
= <regular-expression-dfa-state>)
=> (start-state :: <regular-expression-dfa-state>,
num-dfa-states :: <integer>);
let worklist :: <deque> = make(<deque>);
<Define the states table and locate-state>
let start-state
= locate-state(regular-expression.regular-expression-first-positions);
while(~empty?(worklist))
<Set label and state from the worklist>
<Process each position in label for state>
<Finalize the transitions for state>
end while;
values(start-state, states.size);
end function;
Since we could end up with a fairly large number of states,
we need to have a way of locating them quickly. Here we
define a table class for DFA states, with the sets of
corresponding positions in the Glushkov automaton as keys.
Dylan allows us to do this by subclassing
<table> and specializing the function
table-protocol to return key equality and hash
functions.
<Module regular-expression>+=
define class <set-table> (<table>)
// No additional slots.
end class;
define method table-protocol(table :: <set-table>)
=> (test-function :: <function>, hash-function :: <function>);
values(method(set1 :: <list>, set2 :: <list>)
=> (equivalent?);
<Compare sets set1 and set2 for equality>
end method,
method(set :: <list>, initial-state :: <object>)
=> (hash-id :: <integer>, hash-state :: <object>);
<Compute the hash value for set>
end method);
end method;
Two sets are equal if they have the same number of elements and every element of the first set is also a member of the second set.
<Compare sets set1 and set2 for equality>=
set1.size = set2.size
& every?(rcurry(member?, set2), set1);
The hash function simply combines the hash values of all of the individual items, insuring that their order within the sequence is irrelevant.
<Compute the hash value for set>=
if (empty?(set))
values(0, initial-state)
else
let (hash-id :: <integer>, hash-state :: <object>)
= object-hash(first(set), initial-state);
for(item in set, first? = #t then #f)
if (~first?)
let (item-hash-id :: <integer>,
item-hash-state :: <object>)
= object-hash(item, hash-state);
hash-id := merge-hash-ids(hash-id, item-hash-id, ordered: #f);
hash-state := item-hash-state;
end if;
end for;
values(hash-id, hash-state);
end if;
Given this, we can define a table for looking up states and
the local locate-state method. If
locate-state finds a labeled state in the table
it returns it, otherwise it creates a new instance and adds it
to the table and (along with its label) to the worklist.
<Define the states table and locate-state>=
let states :: <set-table> = make(<set-table>);
local
method locate-state
(label :: <list>)
=> (state :: <regular-expression-dfa-state>);
let result = element(states, label, default: #f);
if(result)
result;
else
let transitions
= if (transition-collection-size)
make(transition-collection-class,
size: transition-collection-size);
else
make(transition-collection-class);
end if;
let state = make(state-class, transitions: transitions);
states[label] := state;
push-last(worklist, state);
push-last(worklist, label);
state;
end if;
end method;
We retrieve states from the front end of the worklist.
<Set label and state from the worklist>=
let state = pop(worklist);
let label = pop(worklist);
When processing a state, we look at all of the positions in its label, and determine what transitions they engender.
<Process each position in label for state>=
for(position in label)
do-regular-expression-dfa-state-position(state, position,
deterministic?: deterministic?);
end for;
Making
do-regular-expression-dfa-state-position an open
generic allows applications to customize the handling of
accepting states. The method for symbol positions computes
state transitions by using the state's
regular-expression-dfa-state-transitions as
temporary storage for the labels of successor states. If we
are requiring that the automaton be deterministic, then only
one position is allowed to determine the transition on a given
symbol for any DFA state.
<Exports from the regular-expression module>+=
export do-regular-expression-dfa-state-position;
<Module regular-expression>+=
define open generic do-regular-expression-dfa-state-position
(state :: <regular-expression-dfa-state>,
position :: <regular-expression>,
#key deterministic?)
=> ();
define sealed method do-regular-expression-dfa-state-position
(state :: <regular-expression-dfa-state>,
position :: <symbol-regular-expression>,
#key deterministic? = #f)
=> ();
let follow = position.regular-expression-follow-positions;
let symbol = position.regular-expression-symbol;
let present = element(state.regular-expression-dfa-state-transitions,
symbol, default: #f);
if(present)
if(deterministic?)
error("regular expression is non-deterministic on symbol %=", symbol);
else
element(state.regular-expression-dfa-state-transitions, symbol)
:= union(present, follow);
end if;
else
element(state.regular-expression-dfa-state-transitions, symbol)
:= follow;
end;
end method;
define sealed method do-regular-expression-dfa-state-position
(state :: <regular-expression-dfa-state>,
position :: <symbol-set-regular-expression>,
#key deterministic? = #f)
=> ();
let follow = position.regular-expression-follow-positions;
for(symbol in position.regular-expression-symbol-set)
let present = element(state.regular-expression-dfa-state-transitions,
symbol, default: #f);
if(present)
if(deterministic?)
error("regular expression is non-deterministic on symbol %=", symbol);
else
element(state.regular-expression-dfa-state-transitions, symbol)
:= union(present, follow);
end if;
else
element(state.regular-expression-dfa-state-transitions, symbol)
:= follow;
end;
end for;
end method;
We finalize the state by converting the elements in the transition collection from state labels to states.
<Finalize the transitions for state>=
let transitions = state.regular-expression-dfa-state-transitions;
let (initial-state, limit, next-state, finished-state?, current-key,
current-element, current-element-setter, copy-state)
= forward-iteration-protocol(transitions);
for(iteration = initial-state then next-state(transitions, iteration),
until: finished-state?(transitions, iteration, limit))
if(current-element(transitions, iteration))
current-element(transitions, iteration)
:= locate-state(current-element(transitions, iteration));
end if;
end for;