Introduction
The attached code is an implementation of the VF graph isomorphism algorithm. This algorithm is available at the VF Graph Comparing library, and there are other programs which form a wrapper to call into this library from, for instance, Python. The same could certainly be done for C#, but the code here implements the algorithm entirely in C#, bypassing the C++ library altogether. There were a few reasons for this. First, this offers the algorithm in an entirely managed environment. Second, I found the original C++ code to be not so efficient and not terribly well written. Third, the VF library was actually coded to do comparisons between various algorithms so you get those algorithms in tow whenever you use the library, and finally, it just sounded like fun.
Background
The graph isomorphism problem accepts a pair of directed graphs as an input, and returns true if one graph is just a "relabeling" of the other. The VF algorithm will also perform a subgraph isomorphism in which the there is a subgraph of the first graph which is isomorphic to the second graph. It also allows for retrieving the mapping after a match has been made. It allows for context checking in which nodes can carry attributes, and those attributes can be tested against the corresponding attributes for the isomorphic node, and the matching can be rejected based on the outcome of such tests. At its simplest, this allows the user to "color" nodes and only allow nodes of the same color to be matched.
The VF algorithm essentially builds up the isomorphism one match at a time, and does an educated backtracking search for the matches. There are various criteria when determining if the addition of a match is acceptable. Some of these are merely to boost performance, but the most important criterion is that the newly added match maintains the isomorphism of the group gathered thus far. In this way, it's pretty obvious that when you've handled all the nodes in the second graph, you've achieved the isomorphism you're looking for.
The details for all this can be found in the paper: A (sub)graph isomorphism algorithm for matching large graphs. My implementation follows this paper pretty exactly. About the only difference is that I use a sorted list for some of the internal sets of nodes, which means I'm more liable to come onto a viable match earlier. It also means that I can stop searches earlier than in the original algorithm.
Using the code
The code is implemented as a class library. There are two primary classes required to interface with the code. The first is the Graph
class which is used to create/manipulate your graph. The only constructor for Graph
is the default one which takes no parameters. Once you have a graph, you can insert nodes using InsertNode()
and edges using InsertEdge()
. Normally, InsertNode
is called with no parameters, in which case it returns the integer ID for the inserted node. These integer IDs can then be used in the InsertEdge(int node1, int node2)
call to insert edges between nodes. An alternative form of InsertNode
takes a single object
parameter which represents the attribute for the inserted node. The attributes can be arbitrary, or they can represent an implementation of the IContextCheck
interface. The IContextCheck
is very simple, and looks like this:
interface IContextCheck
{
bool FCompatible(IContextCheck icc);
}
As shown below, the match can be set up to use these context check attributes, in which case any two nodes which are examined as a possible match have the context check called on them. If the context check returns false
, then no match is allowed. In this case, all attributes in the graph must implement IContextCheck
. Any nodes with no attribute will match with any other node. Attributes can also be attached to edges via the InsertEdge(iNode1, iNode2, object attr)
, but they (currently) can't be used to affect the match as the attributes attached to nodes are.
Graph graph = Graph();
graph.InsertNodes(4)
graph.InsertEdge(0,1);
graph.InsertEdge(1,2);
graph.InsertEdge(2,3);
graph.InsertEdge(3,0);
There are also methods to delete nodes (DeleteNode(int ID)
) and edges (DeleteEdge(int idFrom, idTo)
). Please note that we are entering directed edges so that InsertEdge(1,0)
is a different edge from the one inserted with InsertEdge(1,0)
.
Once you have created a pair of graphs which you wish to check for isomorphism, you can create a VfState
object with them using the constructor VfState(Graph graph1, Graph graph2)
. The VfState
can then be used to do the actual check, as follows:
...
VfState vfs = new VfState(graph1, graph2);
bool fIsomorphic = vfs.FMatch();
if (fIsomorphic)
{
int[] mapping = vfs.Mapping1To2;
...
There is also a Mapping2To1
with the mapping from graph2
to graph1
. In subgraph isomorphism (which is used by default), the algorithm looks for a subgraph of graph1
which is isomorphic to graph2
. The VfState
can take two boolean
s as either VfState(graph gr1, graph gr2, boolean fIso)
or VfState(graph gr1, graph gr2, boolean fIso, boolean fContextCheck)
. If fIso
is true
, then the algorithm searches for a strict isomorphism. This is more efficient than searching for a subgraph isomorphism. If fContextCheck
is true
, then all node attributes are context checked against the counterparts in the mapping, as outlined above.
Update: In addition to the above two signatures for the VfState
constructor, there is now another which takes a third boolean
: VfState(graph gr1, graph gr2, boolean fIso, boolean fContextCheck, boolean fFindAll)
. fFindAll
is false
, by default. If true
, then after the match, you can get a list of all mappings of all isomorphisms. Each isomorphism is represented in the list by a FullMapping
structure, which just holds the 1 to 2 and 2 to 1 mappings in two int[]
arrays, as in the following example.
...
VfState vfs = new VfState(graph1, graph2, false, false, true);
bool fIsomorphic = vfs.FMatch();
if (fIsomorphic)
{
foreach(FullMapping fm in vfs.Mappings)
{
int[] mapping1to2 = fm.arinodMap1To2;
int[] mapping2to1 = fm.arinodMap2To1;
...
Algorithm overview
As mentioned above, the algorithm is, basically, a search over the space of matches for an isomorphism between graph2
and some subgraph of graph1
. By "match", I mean a mapping from a single node of graph1
to a single node of graph2
. Each step in the search algorithm proposes adding a new match to the isomorphism as developed so far. If we can extend these matches to include all of graph2
, then we're finished. If not, we have to backtrack and attempt different matches. The top level algorithm of this search looks like this (since the 4/29 update, this is really pseudo code, with the stuff for multiple isomorphisms left out for clarity, but the basic idea is still intact):
bool FMatchRecursive()
{
Match mchCur;
CandidateFinder cf;
if (FCompleteMatch())
{
return true;
}
cf = new CandidateFinder(this);
while ((mchCur = cf.NextCandidateMatch()) != null)
{
if (FFeasible(mchCur))
{
BacktrackRecord btr = new BacktrackRecord();
AddMatchToSolution(mchCur, btr);
if (FMatchRecursive())
{
_fSuccessfulMatch = true;
return true;
}
btr.Backtrack(this);
}
}
return false;
}
I should point out that this is a recursive version which is easier to understand, but ifdef
ed out in the actual code in favor of a non-recursive version. You can ifdef
it back in and verify that it works via the NUnit code, but it may blow the stack with sufficiently large graphs and will be less efficient. The method uses a CandidateFinder
to locate potential matches, runs through them, and uses a feasible method to do more careful checking. This checking includes ensuring that the match doesn't break the isomorphism we've established thus far. BacktrackRecord
allows us to undo the effects of the call to AddMatchToSolution
in case this match doesn't work out. We provisionally add the match to the current isomorphism, extending its range by one more node in each graph, and we recursively check to see if we can extend this new isomorphism to a complete mapping over graph2
. If so, we're done and return true
. If not, we move back to the state before we added the match, and look for other potential matches in the main loop.
The real key functions here are NextCandidateMatch
and FFeasible
. These two functions depend very much on a division of the nodes into four sets on each graph. The first set, which we'll refer to as I1/I2 on graph1/graph2, are the nodes which are participating in the isomorphism produced at each stage of the algorithm. The second set, T1/T2 for graph1/graph2, is the set of nodes which have at least one edge pointing towards a node in I1/I2. The third set, F1/F2 for graph1/graph2, is the set of nodes which have at least one edge pointing to them from a node in I1/I2. Note that if a node has both predecessors and successors in I1/I2, then it is a member of both F1/F2 and T1/T2. Finally, nodes which are disconnected from any node in the isomorphism are in sets D1/D2.
These groups are maintained by the use of an enum:
[Flags]
enum Groups
{
ContainedInMapping = 1,
FromMapping = 2,
ToMapping = 4,
Disconnected = 8
}
Each node maintains a field with this enum in it stating which group it belongs to. In addition, the VfState
class keeps a list of each of these groups, sorted by degree of each node.
Obviously, if we are to maintain the isomorphism by adding a match between a node N1
in graph1
and N2
in graph2
, certain structural rules must hold, and this is how our search manages to whittle down the candidates to a select few and weed out incorrect matches fairly quickly. For instance, the CandidateFinder
ensures that either both nodes come from the T sets, or both from the F sets, or both from the D sets. The case for the T sets stems from the observation that if N1
points into the current isomorphism, then so must N2
. Similarly for the F and D sets. If there are no nodes in either T1 or T2, the candidate finder looks to the F sets, and after that, the D sets. When CandidateFinder
searches through a particular group for matches, it does so by decreasing degree since that is the way the lists are sorted. This has a couple of implications. First of all, it gives us a better shot at matching the correct nodes earlier rather than later. Secondly, since N1
must have at least as many edges as N2
(remember that graph2
can map into a subgraph of graph1
), we can give up as soon as we find a node in the T1 list with degree less than N2
since all nodes thereafter will have still lower degree. Actually, if a true isomorphism is being sought, then we can stop if the first node of T1 has a different degree than the first node of T2. There are many situations where we have an equality failure condition in the case of a true isomorphism or a >= failure condition in the case of a subgraph isomorphism. For that reason, we have a delegate in the VfState
class, fnComp
, which performs one or the other of these tests, and we set it once when the VfState
is constructed and then use fnComp
throughout the code thereafter for these tests.
Once the CandidateFinder
has selected a viable match, we do further checking with FFeasible()
. There are essentially four tests that a match must pass to be allowed through FFeasible()
. The first and foremost, simply verifies that the match, when added to the current isomorphism, doesn't wreck that isomorphism. To be precise, every node in the isomorphism which is connected to N1
must map to a node which is similarly connected to N2
. In addition, if context checking is being done, it is also done here in the feasibility test, and matches can be rejected if their attributes conflict. This is really the only absolute requirement. As long as this is true, as we add nodes, we will always maintain a valid isomorphism of the nodes in the I sets so that when all nodes of graph2
are in I2, we will have our completed isomorphism. However, using only this criteria would allow many matches which would have to be manually backtracked over eventually, so we want to prune our choices here, and that is what the other three requirements of FFeasible
are all about.
The other three requirements are really just versions of the same rule applied to the T, F, and D groups, so I'll just discuss it for the T groups, but the same logic applies to the other two groups. I'm still assuming that we are adding a match between nodes N1
of graph1
and N2
of graph2
. We require that the count of In edges from nodes in T1 to N1
exceed the count of In edges from nodes in T2 to N2
. Ditto for the Out edges. This is another situation where we use fnComp
as described above to do an equality test for true isomorphism or inequality for subgraph isomorphism. This same test is applied to the F sets and D sets. The code to do the first test described here (In edges from T1 nodes compared to In edges from T2 nodes) looks like this:
if (!fnCmp(
GetGroupCountInList(lstIn1, _vfgr1, Groups.FromMapping),
GetGroupCountInList(lstIn2, _vfgr2, Groups.FromMapping)))
{
return false;
}
Here, GetGroupCountInList()
just counts the nodes in the list of the first argument which are in the correct group for the graph given by _vfgr1
(graph1
) or _vfgr2
(graph2
). The lstIn1
is just the list of nodes which have an edge pointing towards N1, and similarly for lstIn2
.
When we add the matching to the isomorphism, this potentially moves several other nodes into other groups. The AddMatchToSolution()
method keeps track of all this movement in the backtrack record so everything can be moved back if the match fails. The backtrack records are just linked lists of BacktrackAction
s, each of which is either a movement into the isomorphism or a movement from one group to another.
This completes the overview of the algorithm. I'm not including the conversion between Graph
s and VfGraph
s. The main difference here is to sort the nodes in the order of increasing degree in the VfGraph
and the setup of various lists of nodes in preparation for the VF algorithm. All the nodes are initially placed in the D group, and the I group is empty. We also keep the mapping between the unsorted nodes in the original Graph
structure and the nodes in the VfGraph
structure so that we can sort it all out in the end and return mappings via Mapping2To1
and Mapping1To2
, which refer to the node IDs in the original Graph
objects rather than the IDs used in the VfGraph
s. This is all really more in the matter of bookkeeping than any part of the algorithm proper.
Performance
When I started looking at the code, I wanted to find out how many matches were getting through the candidate finder, how many were getting through the feasibility checker, and how many made it through both but were ultimately backtracked over. I got my random graph creator to create a random graph with 1000 nodes, and found to my surprise that no nodes were being backtracked - the combination of the candidate finder and the feasibility checking was acting as a perfect oracle of what would end up in the final isomorphism. That seemed impossible, and I suspected a bug. To test this, I commented out the call to backtrack, and ran all the NUnit code. Two tests failed, which made me feel much better - my backtrack code was actually useful. After I started thinking about this, it all began to make much more sense. In the case of a random graph, there are probably very few nodes which have maximum degree. Of those with the same maximum degree, in order to pass the feasibility test, they must have the same number of Out edges and In edges. More than likely, this uniquely identifies the first match to be entered into the isomorphism. From there on, as the isomorphism builds up, the chances that the isomorphism itself is wrong go rapidly to essentially zero, since to assume otherwise is to assume that there are two isomorphically identical portions in the random graph. Ditto for adding new matches, since they must be attached to the current isomorphism and more than likely can only fit in one place of the isomorphism. So, we have a good chance of getting the first matches correct, and after those, the chances go up rather than down. Consequently, it began to dawn on me that the algorithm is going to work very well on random graphs. In fact, if we assume no backtracking and a constant average degree, I believe that the algorithm is O(n^2), which is great.
Of course, at this point, I wanted to try to find much worse cases. It seems obvious that it's going to have more backtracks if the graph is "highly automorphic", by which I mean that large portions of the graph are isomorphic to other portions. It seemed to me that a really bad case of this was a grid - nodes at the intersections of the gridlines, and edges between them along the grid lines. When I tried this, sure enough, there were lots and lots of backtracks. I realized that it ought to be worse when there was no isomorphism, so I hooked two opposite "corners" of the grid together with an edge in one graph which wasn't present in the second. I had to cancel out of this run after a long wait, because I think it's pretty close to the worst case. Since the corners had lower degrees than the middle nodes, they were going to be examined after all the middle nodes had been checked for isomorphism, and since they were the only non-isomorphic part, it was going to try zillions of backups. To verify this, instead of one graph between these two nodes, I inserted several so their higher degree caused the algorithm to examine them first. This caused the runtime to be almost immediate. Of course, I was trying this with a 30x30 grid, and it probably would have run reasonably well for smaller versions. One possibility which would have eliminated this particular problem is to have just done a comparison of node degrees. Since I'm already sorting nodes by degree, this should be easy. It will also work for subgraphs where I have to establish a one to one mapping between nodes in the second graph and nodes with a larger degree in the first graph.
Another potential problem area might be comparing a graph with much higher degrees with one with much lower degrees. This could eliminate the advantage of sorting the nodes, tending to bring isomorphic nodes together in the candidate finder. I don't think it would be too awful a problem since the original algorithm described in the links at the start of this article never took advantage of graph ordering at all. Of course, if the degrees got high enough, it might actually be easier since nodes will match to other nodes more easily. In the extreme, any 1-1 mapping will do if the first graph is complete.
So, the lesson is that the worst case can be pretty awful, but smaller graphs or more random graphs are pretty safe to expect a O(n^2) performance.
I also did some very non-disciplined timing (I didn't try to eliminate any background processing, for instance), and came up with the following for the random graphs in my test:
Note that the time/node has been multiplied by 4000 so that it scales nicely on the graph. The edge load factor for the 1000 node graph was 0.1 so that each node has an average of about 100 In edges and 100 Out edges. I wanted to hold the degrees the same for all the graphs, so for the other sized graphs, the edge load factor is inversely proportional to the count of nodes. Thus, the load factor for the 2000 node graph was 0.05.
Points of interest
A few caveats should be pointed out. First of all, when we speak of subgraph isomorphism, the definition of subgraph here is a graph obtained by removing nodes and their attached edges. This means that you can't remove edges directly in the subgraph unless they are attached to a removed node. I don't think this is any big problem to allow for, but haven't incorporated it yet.
The code has NUnit testing included. I like NUnit, and recommend you get it and the excellent TestDriven.NET Visual Studio add-in. They're both free and very worthwhile, and I think you'll be happy you did in the end. However, if you can't bring yourself to take these simple steps, then you can just eliminate the NUnit.FrameWork
reference in the project and take the NUNIT
definition out of the debug configuration. Even in this case, you can probably understand the code a bit better by looking at the NUnit code. There is no test or demo program in the code since it's been tested entirely through NUnit.
One last point - I would assume that the base code runs just fine on Mono, but I'm not set up to test it. I'm not sure whether NUnit runs there, so it might have to be removed also.
History
- 1-23-08 - First version.
- 1-24-08 - Algorithm overview added, and (whoops!) forgot originally to make
FMatch public
. Added back in an ifdef
ed out recursive call, and made a name change or two. - 1-30-08 - Corrected the first link, added a section on performance, and added a quick one time degree compatibility check between graphs.
- 4-29-08 - Added in the ability to get a list of all isomorphisms rather than the first one found.