Introduction
Without giving a Computer Science 101 lecture, I'll provide you with 2 pictures that says a thousand words. In short, DFA stands for Deterministic Finite Automation (or Automata). It's cousin is NFA, non-deterministic FA. An example of a NFA is looping through a big list looking for a word. The word might be found quickley, or it might only find it in the last entry. This is said to be non-deterministic. So we can then say for a DFA we will know exactly how long a search for a word (as per the NFA example) will take.
Data Samples
The "raw" data (in no particular order) ready for NFA processing (you know the lazy way).
TABLE A
leppie can swim
leppie can talk
leppie looks good
leppie looks left
leppie goes home
john looks fine
john can walk
john wants nothing
leppie wants food
leppie can code
leppie can code C
leppie can code C#
leppie can code C# well
leppie can code C++ poorly
leppie lives in stellenbosch
leppie lives with a flatmate
leppie drinks plenty coke
leppie drinks coffee
leppie eats many hotdogs
leppie eats food
leppie knows alot
leppie knows alot about code
leppie knows a little about girls
john can swim
john can talk too
TABLE B
stop
stopper
stood
step
standard
stank
stance
authors
automatic
back
back's
backed
background
backing
backing's
backs
backwards
bad
badly
balance
balance's
ball
ball's
chain
chain's
chair
chair's
chairman
chairman's
chance
chance's
chances
change
changed
changes
changing
channel
channel's
channels
chaos
chaos's
chapter
chapter's
char
charge
charged
charges
fall
fallen
falling
falls
false
familiar
family
family's
famous
fan
fan's
fancy
far
farm
DFA representation of Table A. A circle with an outer circle denotes an endstate.
DFA representation of Table B. A circle with an outer circle denotes an endstate.
You might ask what that System.Object is doing there, well nothing. UPDATE: I have now removed the root object from the generated graph. As you can see, it is not an endstate, so it will not affect anything. But it does serve a purpose. It acts as the very first node that is created regardless of the key type. I can remove it, but then I would have to check for null everytime and that would be too costly. So dont worry about it!
Base Implementation: AtomicState
public class AtomicState
{
protected AtomicState();
protected AtomicState(object key);
protected bool Accepts(Array stack);
public Array AcceptStates(Array stack);
protected bool Add(Array stack);
public AtomicState GetAtomicStateAt(Array stack);
protected Array Match(Array stack);
protected bool Remove(Array stack);
internal protected virtual void RenderStateNodeAttributes attr);
internal protected virtual void RenderTransistion(EdgeAttributes attr);
public override string ToString();
public int AcceptStateCount { get; }
public int TotalStateCount { get; }
protected object key;
}
This base implementation for the DFA state machine. I have provided several implementations for typesafe Int32, String, Char, Boolean, Float, Object and Type. Feel free to add your own and see comments I have meda in the code. Arrays of Arrays have terrible casting issues. I have also included a Combination class that I did many moons back, and never had a place to put. Combinations are very good to build DFA's. Also included is a utiliy class to generate all these pretty pictures (yes, this is aimed directly at you, Marc Clifton ;P) you can see here (you will need GraphViz for this). UPDATE: The GraphViz utilities has been greatly improved, and from above you can see you can now specify rendering options on a State level. The implementation is similar to ASP.NET's custom control Rendering. In other words, if you need to change something, just override RenderXXX. See TypeState's implementation for a good example.
The class consists mostly of non-human comprehensable code (almost every function is either running in a while(true) loop or recusively). This is most mostly a generic port of a C project that was done for Computer Science. From the 2 count functions's code, you will see the "multi dimensional linked list" is rather easily "walkable" with recursion (I had some stack issues, but havent been able to replicate them).
Purpose
So you may ask yourself why or how do I use this? The way the machine has been setup, is to take a "path" at a time, so adding a sentence or a word, get automatically laid out in the machine. No other intervention is required. One of the most difficult things to do is changing a NFA to a DFA, this is all done for you. Then all you need to do is query it. Some functions like Match() can take longer with "wildcards" (set these up as null, see the implementations).
You might ask me now why I dont just use an ArrayList or a HashTable? Simple. An ArrayList is good for batches of data, but poor at searching. A HashTable is excellent for lookups, but fails to be useful for pattern matching or "path" finding. In fact you should be using either of the afore mentioned in most cases. Examples of what this is useful for:
- Spellchecker
- Word/code completion
- Code parser
- Path finding
- Any many more things most people only dream of
Performance
As with any DFA performance is key. This implementation is comparable to a Hashtable. In fact, if boxing was removed, lookups would be even nearer (2x instead of 3x in my experiments). Here is a simple console output from a 500 000 wordlist loaded in a char[]:
Extracting all entries
Time: 5752.794559ms
CharState Testing 10000 random lookups from 502069 entries
Time: 133.561947ms
Avg time: 0.013356ms
HashTable Testing 10000 random lookups from 502069 entries
Time: 37.712056ms
Avg time: 0.003771ms
As one can see, lookups are 3 times slower than a Hashtable, but remember that a HashTable is incredibly fast, but does not have the same characteristics. Note the slowness of extracting all the entries. Not bad if you think that the DFA is converted to raw data again.
Thanks to the person who wrote the HP timer class I used for the timing (sorry, cant remember your name now).
Walkthrough
Say for example you wanted a quick inheritance graph of an assembly. You would simple do this:
TypeState typeroot = new TypeState();
Type[] types = Assembly.GetAssembly(typeof(AtomicState)).GetTypes();
foreach (Type type in types)
{
ArrayList arr = new ArrayList();
Type etype = type;
do
{
arr.Insert(0, etype);
etype = etype.BaseType;
}
while (etype != null);
typeroot.Add( (Type[]) arr.ToArray(typeof(Type)));
}
TextWriter writer = File.OpenText("file.dot");
GraphViz.Generate(typeroot, writer);
Now we have a string we can pass to Dot (the GraphViz executable) and we end up with:
Pretty isn't it? But hey, that isnt even the purpose! Luckily GraphViz fits like a glove and its making many more things possible. If any one wants to send some network trace logs, I'll put up some renderings. UPDATE: Here they are! I have cropped the graph somewhat.
Observations
A question might arise why things like "leppie can eat" and "john can eat" does flow back into each other. This is infact possible, but a problem arises when you add "leppie can eat meat". Does this mean john can eat meat too? It depends how accurate you want to be. In my case, I have opted for accuracy. Adding this facility would only increase load time and to make it 100% accurate will require an NFA pass or a major change to the algorhythm. Suggestions are welcome.
Conclusion
I hope you enjoy using this, and feel free to comment as usual. The test project, basically contains just some tests I have run, and will probably cause an exception on your system. Look at it however to see how its meant to be used.
Changelog
1.1
- Totally redone GraphViz utility and added rendering on a State level. Both transition (edges) and state (node) attributes can be tweaked. I have kept the naming as near as possible to GraphViz's syntax for those that are fimiliar with the product.
- Uploaded new version with working tests (as you see them here) and pictures.
1.2
- Fix the performance bug.
- Made the pictures prettier :)
- Built a NET callable dll for Dot. This can be downloaded seperately. The compiling and linking and sorting out MC++ issues was a pain, but it finally works now. If you have problems I can send you a project file, but be warned, its a mess too.
Future Plans
Now, so all of you will come back and back again, I'm listing some plans I have for the future (viable suggestions will be added too):
- Ability to add state and transistion data, for the ultimate rendering experience.
- Perhaps, build a .NET wrapper for GraphViz (if someone can send me a solution file for the source, it would appreciated as the source is a mess). UPDATE : Built the dll, now for the wrapper. Has a small interface to call from .NET.
- I thought perhaps at a later stage I could implement the renderning GDI. Perhaps, later.