Source Code
The C and C# code discussed in this article can be found here:
https://github.com/zippy/ceptr
A good blog post (including a couple videos!) on the concept of Ceptr and Semtrex can be found here:
http://ceptr.org/2015/04/06/semantic-trees/
Introduction
In this article, I would like to introduce you to the work of Eric Harris-Braun and Arthur Brock. They have been working on a concept that they call "Semtrex" - a semantic tree regular expression matcher, as part of a larger project called "Ceptr."
Two cornerstone of the Ceptr concept are:
- data always carries with it its semantics, in other words, the qualities that give the data meaning
- the meaning (semantics) of something is very naturally expressed as tree structures
As a result of these premises, it became clear that, similar to how we have a regular expression (Regex) matcher for one-dimensional strings, a tree expression matcher is needed for trees of two or more dimensions. Furthermore, because the trees incorporate meaning (semantics), the tree expression matcher must be a semantic tree regular expression matcher. Here we see another divergence from Regex -- Semtrex matches on the semantics of the data as well as literal values, whereas Regex, having no concept of the semantics on which it is matching, can match only on the literal values.
This introduces some new and interesting behaviors to pattern matching -- matching not just the literal values of the data but also the semantic meaning of the data. Thus, the while "42" matches with the literal "42", "42 years old" does not match with "42 is the answer to the ultimate question of life, the universe, and everything."
The Nature of Nature
Let's consider why semantic trees are important and, to be rather bold about it, actually necessary for the future evolution of computing, communication, and data exchange. There are both philosophical and concrete, computational reasons, each guiding the other in understanding. We will consider how everything in nature (though our focus is on data) has two characteristics: meaning and structure (relationship.)
The Nature of Meaning
"Semantics is the study of meaning."1 In order for information to have value, that information must have meaning. The term "value" can be interpreted as meaning both "valuable" and "computable." Information is not valuable unless it is meaningful. Information has values on which to base computations. So, by value, we mean in part that computations can be performed on that value or that the value is itself part of a computation involving many other values, and that we find both the values and the computations valuable. Take, for example, the Semantic Web3 which "provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries."3 Note that the term itself was coined by Tim Berners-Lee4 for a web of data that can be processed by machines.5
The Meaning of Nature
Nature is relational -- from the most fundamental theoretical models of quantum physics to largest macrocosmic scale of physical galaxies, we constantly see that "things" are in relationship with other "things." Furthermore, these relationships have structure which we humans, with great frequency, map into a hierarchical, tree-like space. We do this with organizational charts, phylogeny, and family trees. We use parse trees7 for parsing a context-free grammar. Compilers take our linear code and parse it into BNF8 trees to generate (again linear) machine code.
Structure (the relationship between values) is also a necessary component of meaning and therefore of computation. Just as you cannot perform a computation on a value without at least an implicit semantic understanding of what that value is, you cannot perform computations on groups of values without having a defined structure that relates those values to each other. Furthermore, that structure (the defining of relationships) itself creates new meaning.
Meaning: Symbols and Structures
Eric Harris-Braun: "Meaning comes from having a social context in which symbols are applied to embodied forms (structures!)"
Arthur Brock: "Meaning comes from understanding the use of a symbol in a specific context. A context means a collection of symbols used in a particular way."
For example, here is an ottoman used as a char, a footstool, and a table.
(Arthur Brock working, relaxing, and play chess with Eric Harris-Braun)
The context provides essential additional meaning to the symbol "ottoman."
Structure (the defining of relationships) creates new meaning is a key point, and while it is (and in fact must be) self-reflective or "fractal", it explains why we explicitly (or implicitly) create hierarchical structures. For example, given something we want to measure, we assign a symbol to it--inches, feet, kilograms, light-years. This gives the measurement a semantic context. We see this "thing" in relationship with other "things" that also have semantic meaning, and so we create a structure that maps the relationship of those "things." That structure is itself assigned a semantic meaning, and so on, recursing in a fractal manner either up or down the hierarchies that we are creating, allowing us to express more and more complex and interesting concepts and relationships between those concepts.
Eric Harris-Braun: "This is an important pattern in the real world and in programming ("has a" relationship). This pattern of alternation is a fundamental pattern (programming languages, real world, etc) what we're trying to do in Ceptr is make that fundamental pattern explicit and have some self knowledge of what we're doing."
Arthur Brock: "What we're doing in Ceptr ... when you program, the programmer has to deal with these things as meaningful units (variable names) but then when we compile it, it all goes away, collapsed down to structured data in memory addresses. We then have difficulty sharing the meaning, we have to then construct a semantic layer where we "hydrate" the meaning back into the structure."
Living Meaning
Meaning is not static and this is a major contributor to why software becomes obsolete: software is incapable of adapting, beyond a certain limit, to changes in meaning.
- Values change -- they are created, destroyed, and changed. This happens with the greatest frequency and is the only change that modern day software typically handles well (without programmer intervention.)
- Relationships between values change -- old relationships become "meaningless", and new relationships are necessary between new values. While this happens slower, it could be argued to be the primary reason for why software needs to be updated and eventually replaced. For example, the lifetime of a relational database, which is expressing fixed relationship in tables and foreign keys, is limited by the ability to accommodate new relationships (new tables, changes in foreign keys, etc.)
- Meaning itself changes, albeit very slowly over time. For example, consider how the meaning of the word "gay" has changed over time.
- New structures are formed and given semantic meaning. One can observe how computer languages have evolved as new structures are incorporated into the language, such as lambda expressions, anonymous methods, and C#'s async/await keywords, all of which express older concepts in more efficient semantics.
Semtrex
Realizing that meaning involves semantics and structure, Semtrex is a cornerstone to a much larger "rewrite" of the computing space in which we currently live. At present, our computational space takes semantically structured data, decouples the semantic meaning and structure into raw data streams or storage (with or without a separate schema), and then re-assembles that data into either the same or different semantic structures. Ironically, this is the impetus for a new field of information and computer science called ontology engineering "which studies the methods and methodologies for building ontologies: formal representations of a set of concepts within a domain and the relationships between those concepts."9
For example, the SMTP Specification10 defines, among other things, the structure and semantics of email addresses. When we write "foo@biz.eu", we strip off the structure and semantic meaning and in fact store and transmit the email address with none of its original structure or semantics. When we write a mail server, we have to "parse" the email address, "hydrating" its structure and the semantics of its constituent parts so that we can perform computations on the address such as validation, verification, checking against white lists, black lists, perform billing, and so forth. Furthermore, the system is very rigid. For example, we cannot send an email to a physical address which could be handled by a system that physically prints and mails the content. We cannot send an email to a phone number which would then dial the phone and read the content with text-to-speech synthesis. We cannot send an email to a mobile phone's text messaging service. Why? Because the mail server must assume a specific semantic meaning to the value in the "mail-to" slot--it must be an email address--because the "mail-to" slot does not carry along with it the semantics of its value.
Edge Space
A cornerstone of the Ceptr architecture is that it operates exclusively with semantic data and therefore semantic trees. In the broader picture, Ceptr provides a means for processing semantic data in virtual machine "receptors." Data is moved in and out of receptors as semantic packages, or signals, which embody of course both the data and the semantics corresponding to that data. In this way, receptors only process those things that have semantic meaning to them.
However, the rest of the world does not usually exchange semantic information -- instead, it's often raw binary data or human-readable ASCII strings (like this article.) We see some introduction of semantic meaning in XML and JSON packets, both in their ability to structure the data and associate attribute names (or keys) to the data. However, even with XML and JSON, the attribute and keys are often high level abstractions, like "address" or "full name", or domain-specific, like "domicile" and "maiden name." In the former, a program parsing the data does not know that "address" actually refers, semantically, to a home address, vacation address, or business address (of many options) and in the latter case, again a program parsing the key "domicile" (which may have semantic meaning to the programmer that wrote the XML or JSON) probably requires a "translation" to the semantic meaning "home address."
While this leaves the world with an almost infinite need for programmers to constantly reformat one semantically poor data structure into another, equally poor, semantic data structure, it still leaves us with the real world issue that this problem does indeed exist. This is the "edge space" where "data" in the outside world and semantic data inside of Ceptr is translated.
Starting with an Example
Marc Clifton has built a simple IDE (implemented in C#) around the C code to provide a playground for working with Semtrex expression matching. We take the example of parsing a latitude and longitude from an ASCII string into a Semtrex and performing some match tests on it.
Building the Symbol and Structure Tree
The first step is to define the structures and their symbols. When we fire up the IDE, we're presented with an empty Symbol Tree. Let's create a container for our demo lat/lon symbols & structures:
Click on "Symbol Namespace" and provide a name for the container. Here, I use "latlon demo":
Right-click on the "latlon demo" container and select "Add Symbol" from the popup menu:
We can now define the symbol and structure names in the Properties pane:
We'll assume that this is a "home location", so we give name the symbol "HomeLocation" and specify that the structure is a "latlon":
We also see the symbol and structure listed in the Symbols and Structures panes:
Next, we define the "latlon" structure. This structure consists of two symbols whose structure is "float":
As the above screenshot illustrates, there is a short list of built-in structures from which one can also choose. Keep in mind that we could even define what we mean by the structure "float" -- we could in fact define it as two integers (again structures of bits) with the symbols "mantissa" and "exponent." This process can be repeated down to the atomic level (and further) of capacitance charge in memory cells.
We now have our symbols and structures defined for parsing a latlon string.
Converting a String to an ASCII Tree
Because, Semtrex is a tree expression matcher, any input string must be converted to an ASCII tree. This is by intent, as it keeps the parser code uniform in that it is always working with trees. An "ASCII Tree", which consists of nothing more than a root node and where each of the children is a single character in the string, is necessary to convert strings into a tree structure on which Semtrex can operate. While this sounds silly, it allows us to stay in the unified world of semantic trees as the input for Semtrex. Another way to put it is, we are describing the string semantically with as much meaning as we can give it at the moment: "a string of ASCII characters" which we wrap into an ASCII Tree representation. We don't know what the string is or what it contains, but now that we've got it in a tree structure, we can parse it.
Given a latlon input string:
We can parse this into an ASCII Tree by clicking "Parse to ASCII Tree". The result is displayed in the Semtrex Tree textbox (only a piece is shown here, you get the idea):
Or, visualized with a D3 tree (fragment shown):
The root of the tree is a symbol "ASCII_CHARS" and each element of the tree is a structure-value pair "ASCII_CHAR" and the value of the char.
Matching on the ASCII Tree
Now that we have the string represented as a tree, we can write a Semtrex expression to match against the tree:
/ASCII_CHARS/<HomeLocation:<lat:ASCII_CHAR+>,ASCII_CHAR=',',<lon:ASCII_CHAR+>>
Note how this expression not only matches against the format of the input string but also specifies the symbols associated with the structural components of the tree-ified string.
What the above Semtrex expression is doing is as follows:
- The angle brackets indicate that we are on the fly naming (or assigning a semantic symbol) the matching part of the input string.
- In this case, we're creating a symbol:"lat" which will hold the ASCII_CHARS preceding the comma, and one called "lon" which will hold the symbols after the comma.
- Then "HomeLocation" is a symbol tree with "lat" and "lon" as its two children.
- Those named symbols (assuming there are definitions in the system for how they're structured) can be used in later Semtrex matches on the match output they produce.
We enter the Semtrex expression in the next line and click on Parse:
Why do we do this? Because we've specified a Semtrex expression as a string, which itself must be first converted into a tree! So you can see the pattern here: inputs are semantic trees, expressions are semantic trees, and Semtrex matches input trees against expression trees. In fact, all output from Semtrex is a semantic tree as well.
The result of the parse operation can be viewed as well, shown here in full:
(
SEMTREX_SYMBOL_LITERAL
(
SEMTREX_SYMBOL:ASCII_CHARS
)
(
SEMTREX_GROUP:HomeLocation
(
SEMTREX_SEQUENCE
(
SEMTREX_GROUP:lat
(
SEMTREX_ONE_OR_MORE
(
SEMTREX_SYMBOL_LITERAL
(
SEMTREX_SYMBOL:ASCII_CHAR
)
)
)
)
(
SEMTREX_VALUE_LITERAL
(
ASCII_CHAR:','
)
)
(
SEMTREX_GROUP:NULL_SYMBOL
(
SEMTREX_ONE_OR_MORE
(
SEMTREX_SYMBOL_LITERAL
(
SEMTREX_SYMBOL:ASCII_CHAR
)
)
)
)
)
)
)
Again, as a D3 tree:
Does The Input Match?
Now we can ask, "does the input tree match the expected format as specified by the Semtrex match expression?" We ask this by clicking on the "Matches?" button:
And the answer is True:
Furthermore, we also can also ask Semtrex how it matches, which is also a tree:
(
SEMTREX_MATCH:1
(
SEMTREX_MATCH_SYMBOL:HomeLocation
)
(
SEMTREX_MATCH_PATH:/1
)
(
SEMTREX_MATCH_SIBLINGS_COUNT:11
)
(
SEMTREX_MATCH:3
(
SEMTREX_MATCH_SYMBOL:lat
)
(
SEMTREX_MATCH_PATH:/1
)
(
SEMTREX_MATCH_SIBLINGS_COUNT:5
)
)
(
SEMTREX_MATCH:2
(
SEMTREX_MATCH_SYMBOL:lon
)
(
SEMTREX_MATCH_PATH:/7
)
(
SEMTREX_MATCH_SIBLINGS_COUNT:5
)
)
)
In D3:
Notice how the above match result expression tells us with which symbol the structure and it's data is associated.
Embodiment
We can now "embody" the symbols with their values. We now have a structure with meaning -- a semantic structure. We do this by clicking on the "Embody" button:
and we see the result:
Again, as a D3 tree:
Matching Against An Embodied Tree
All of the steps above were actually necessary simply to convert an input string into a semantic tree. The nifty thing is that this was done using Semtrex itself. However, now we arrive at the meat and potatoes of the matter. Let's say data was always semantic -- in other words, what you received as your input was not a relatively meaningless string (albeit perhaps human readable after some training) but a structure that was meaningful to the machine. This:
is meaningful to the machine. We can now ask the machine to match against it, both structurally and the values it contains. Here's a Semtrex expression that matches:
/%HomeLocation/(lat=42.25,lon=73.25)
And we can test this with the Matches? button, which returns true. The astute reader will say, but wait! That input is a string an must, like the previous match expression, be turned into a Semtrex expression first! And you are correct, which is why we must first parse the (sort of) human readable input string into a Semtrex expression:
(
SEMTREX_WALK
(
SEMTREX_SYMBOL_LITERAL
(
SEMTREX_SYMBOL:HomeLocation
)
(
SEMTREX_SEQUENCE
(
SEMTREX_VALUE_LITERAL
(
lat:42.250000
)
)
(
SEMTREX_VALUE_LITERAL
(
lon:73.250000
)
)
)
)
)
And the last D3 graph rendering:
So here you finally see the whole playground:
Building The Semtrex C Code
Semtrex (the parser, specs, and symbol management) are written in C. Eric chose the C language because it is ubiquitous among a wide variety of platforms, everything from Arduino's to, of course, Windows and Linux.
Compiling the C Code
The simplest way to build the C code is to download Eclipse and open the Eclipse project in the Ceptr GitHub repository. The resulting dll file can be copied into the C# bin\Debug or bin\Release folder.
The C# IDE
The IDE is organized into two projects -- "csharp-ide" is the IDE UI. Readers may be familiar with other articles written by Marc Clifton which take advantage of the same approach:
- WeifenLuo's Docking Manager.
- MycroParser used for the UI definition and event wire-up.
- MVC pattern for separating out the model, view, and controllers for each of the dockable panes.
- XTree - a general purpose tree builder.
The second project ("ceptrlib") is the interface to Eric's C code, which must be built with the "allow unsafe code" option. There are several structures, for example:
[StructLayout(LayoutKind.Sequential, Pack = 1), Serializable]
public unsafe struct Defs
{
public TreeNode *structures;
public TreeNode *symbols;
public TreeNode *processes;
public TreeNode *scapes;
};
These must be marked as unsafe to fix the error "Pointers and fixed size buffers may only be used in an unsafe context."
The C# / C Interface
The C calls are made using the DllImport attribute. There are a few:
[DllImport("libceptrlib.dll", CallingConvention = CallingConvention.Cdecl)]
extern static void def_sys();
[DllImport("libceptrlib.dll", CallingConvention = CallingConvention.Cdecl)]
extern static void sys_free();
[DllImport("libceptrlib.dll", CallingConvention = CallingConvention.Cdecl)]
extern static unsafe TreeNode* _t_new_root(SemanticID sid);
[DllImport("libceptrlib.dll", CallingConvention = CallingConvention.Cdecl)]
extern static unsafe SemanticID _d_declare_symbol(TreeNode* symbols, SemanticID sid, string label, UInt16 context);
[DllImport("libceptrlib.dll", CallingConvention = CallingConvention.Cdecl)]
extern static unsafe SemanticID _dv_define_structure(TreeNode* structures, [MarshalAs(UnmanagedType.LPStr)] string label, int num_params, __arglist);
[DllImport("libceptrlib.dll", CallingConvention = CallingConvention.Cdecl)]
extern static unsafe int _t_children(TreeNode* structures);
[DllImport("libceptrlib.dll", CallingConvention = CallingConvention.Cdecl)]
extern static unsafe void __t_dump(Defs* defs, TreeNode* t, int level, char* buf);
[DllImport("libceptrlib.dll", CallingConvention = CallingConvention.Cdecl)]
extern static unsafe TreeNode* makeASCIITree(string stx);
[DllImport("libceptrlib.dll", CallingConvention = CallingConvention.Cdecl)]
extern static unsafe TreeNode* parseSemtrex(Defs* d, string stx);
[DllImport("libceptrlib.dll", CallingConvention = CallingConvention.Cdecl)]
extern static unsafe int _t_match(TreeNode* semtrex, TreeNode* matchAgainst);
[DllImport("libceptrlib.dll", CallingConvention = CallingConvention.Cdecl)]
extern static unsafe int _t_matchr(TreeNode* semtrex, TreeNode* matchAgainst, TreeNode** matchResult);
[DllImport("libceptrlib.dll", CallingConvention = CallingConvention.Cdecl)]
extern static unsafe TreeNode* _t_embody_from_match(Defs* d, TreeNode* matchResult, TreeNode* semtrex);
Initialization and Termination
Having the necessary imported 'C' functions above, we can implement all that is necessary for initializing and terminating the 'C' library:
public unsafe void Initialize()
{
def_sys();
}
public unsafe void Terminate()
{
sys_free();
}
Initializing Structure and Symbol Trees
These are stored as separate trees, and we initialize them with:
public void CreateStructureAndSymbolNodes()
{
Structures = new SemanticID()
{
context = (UInt16)SemanticContexts.SYS_CONTEXT,
flags = (UInt16)SemanticTypes.SEM_TYPE_SYMBOL,
id = (UInt32)SystemSymbolIDs.STRUCTURES_ID
};
Symbols = new SemanticID()
{
context = (UInt16)SemanticContexts.SYS_CONTEXT,
flags = (UInt16)SemanticTypes.SEM_TYPE_SYMBOL,
id = (UInt32)SystemSymbolIDs.SYMBOLS_ID
};
RootStructuresNode = CreateRootNode(Structures);
RootSymbolsNode = CreateRootNode(Symbols);
}
In the C implementation, a SemanticID consists of:
- A context in which it is defined (16 bits). This may specify a local receptor, the code installed in a system or from a compository of code and receptors.
- Flags are a middle 16 bits reserved for flags about the symbol.
- ID is a unique identifier to that context and type.
In other words, a symbol ID is 64 bits. 16 reserved for identifying context of identifier, 16 reserved for markers about the type of symbol (like is it a process, structure or a symbol, or scape or receptor or something else) and 32 for the identifier unique to that context and type, like a structure ID or symbol ID. Note that one of the confusing things is that there is a symbol SYMBOLS_ID (plural) which holds the list of all SYMBOL_ID's.
From the C# perspective, and not wanting to require unsafe support for every assembly that references this interface assembly, the C pointers are mapped to GUID's:
public unsafe Guid CreateRootNode(SemanticID structures)
{
TreeNode *node = _t_new_root(structures);
Guid guid = RegisterNode(node);
return guid;
}
protected unsafe Guid RegisterNode(TreeNode* node)
{
Guid guid = Guid.NewGuid();
nodes[guid] = (IntPtr)node;
return guid;
}
This way, all other assemblies can reference tree nodes using a Guid rather than a C-style TreeNode*. Any node, given the Guid, can in this assembly be restored to the TreeNode*:
protected unsafe TreeNode* GetNode(Guid id)
{
return (TreeNode*)nodes[id];
}
Internal Types
There is of course a limit to how much one can realistically drill down into the symbol-structure space, so there are several internal value types to represent basic machine-understandable types. Remember, you're look at an interface to C code here, and C has no concept of things like generics or reflection, so all of the meaning (funny how we are talking about meaning at both the semantic level and the code level) has to be expressed in structures and enumerations.
public SemanticID GetFloat()
{
SemanticID sid = new SemanticID()
{
context = (UInt16)SemanticContexts.SYS_CONTEXT,
flags = (UInt16)SemanticTypes.SEM_TYPE_STRUCTURE,
id = (UInt32)SystemStructureID.FLOAT_ID
};
return sid;
}
public SemanticID GetString()
{
SemanticID sid = new SemanticID()
{
context = (UInt16)SemanticContexts.SYS_CONTEXT,
flags = (UInt16)SemanticTypes.SEM_TYPE_STRUCTURE,
id = (UInt32)SystemStructureID.CSTRING_ID
};
return sid;
}
public SemanticID GetInteger()
{
SemanticID sid = new SemanticID()
{
context = (UInt16)SemanticContexts.SYS_CONTEXT,
flags = (UInt16)SemanticTypes.SEM_TYPE_STRUCTURE,
id = (UInt32)SystemStructureID.INTEGER_ID
};
return sid;
}
public SemanticID GetList()
{
SemanticID sid = new SemanticID()
{
context = (UInt16)SemanticContexts.SYS_CONTEXT,
flags = (UInt16)SemanticTypes.SEM_TYPE_STRUCTURE,
id = (UInt32)SystemStructureID.LIST_ID
};
return sid;
}
Declaring Symbol and Structures
Here we declare a symbol:
public unsafe SemanticID DeclareSymbol(Guid symbols, SemanticID st, string label, SemanticContexts sc = SemanticContexts.RECEPTOR_CONTEXT)
{
TreeNode *pnode = (TreeNode*)nodes[symbols];
SemanticID symbol = _d_declare_symbol(pnode, st, label, (UInt16)sc);
return symbol;
}
and a structure:
public unsafe SemanticID DefineStructure(Guid structures, string name, SemanticID[] symbolArray, SemanticContexts sc = SemanticContexts.RECEPTOR_CONTEXT)
{
TreeNode *structs = (TreeNode*)nodes[structures];
_dv_define_structure(structs, name, symbolArray.Length, __arglist(symbolArray));
SemanticID st = new SemanticID() { context = (ushort)sc, flags = (ushort)SemanticTypes.SEM_TYPE_STRUCTURE, id = (uint)_t_children(structs) };
return st;
}
Remember that a structure (like "latlon") is actually a collection of symbols ("lat" and "lon"), therefore we pass in the symbols that comprise the structure.
You should start to see a pattern here -- every piece of symbol and structure is an instance of SemanticID
, whether it is a user defined symbol/structure or a system defined symbol/structure.
Converting a String to an ASCII Tree
This is a straight forward process now:
public unsafe Guid GetTree(string str)
{
TreeNode* node = makeASCIITree(str);
Guid nodeID = RegisterNode(node);
return nodeID;
}
Parse the Semtrex Expression
public unsafe Guid ParseSemtrex(Guid g_symbols, Guid g_structures, string expression)
{
Defs defs = CreateDefs(g_symbols, g_structures);
TreeNode* node = parseSemtrex(&defs, expression);
Guid nodeID = RegisterNode(node);
return nodeID;
}
Embody the Match Results
public unsafe Guid Embody(Guid g_symbols, Guid g_structures, Guid matchID, Guid semtrexID)
{
Defs defs = CreateDefs(g_symbols, g_structures);
TreeNode* match = GetNode(matchID);
TreeNode* semtrex = GetNode(semtrexID);
TreeNode* resultTree = _t_embody_from_match(&defs, match, semtrex);
return RegisterNode(resultTree);
}
Match A Tree
public unsafe Tuple<bool, Guid> Match(Guid semtrexID, Guid treeToMatchID)
{
TreeNode* semtrex = GetNode(semtrexID);
TreeNode* treeToMatch = GetNode(treeToMatchID);
TreeNode* resultTree;
int matchState = _t_matchr(semtrex, treeToMatch, &resultTree);
Guid guid = Guid.Empty;
if (matchState == 1)
{
guid = RegisterNode(resultTree);
}
return new Tuple<bool, Guid>(matchState == 1, guid);
}
Testing the Match
public unsafe bool MatchTest(Guid semtrexID, Guid matchAgainstID)
{
TreeNode* semtrex = GetNode(semtrexID);
TreeNode* matchAgainst = GetNode(matchAgainstID);
int ret = _t_match(semtrex, matchAgainst);
return ret == 1;
}
Putting the Pieces Together
In SemanticUIController.cs, you can see how all of these pieces are put together when handling the UI events.
Parsing A String to an ASCII Tree
public void CreateStructuresAndSymbols()
{
symbolMap = new Dictionary<string, SemanticID>();
structureMap = new Dictionary<string, SemanticID>();
ApplicationController.CeptrInterface.CreateStructureAndSymbolNodes();
foreach (string symbolName in ApplicationModel.SymbolRefCount.Keys)
{
if (!symbolMap.ContainsKey(symbolName))
{
Symbol symbol = FindSymbolInTree(symbolName, ApplicationController.SymbolEditorController.View.TreeView.Nodes);
if (symbol == null)
{
throw new Exception("The symbol " + symbolName + " should have been found in the tree.");
}
SemanticID topStructure = ApplicationController.Recurse(symbol, structureMap, symbolMap);
SemanticID symbolID = ApplicationController.CeptrInterface.DeclareSymbol(
ApplicationController.CeptrInterface.RootSymbolsNode, topStructure, symbolName);
symbolMap[symbolName] = symbolID;
}
}
}
public void ToTree(object sender, EventArgs args)
{
CreateStructuresAndSymbols();
asciiTreeID = ApplicationController.CeptrInterface.GetTree(View.tbInputString.Text);
DumpOutput(asciiTreeID);
}
The above code maps the symbol tree (managed simply be the .NET TreeView
control) into a dictionary of symbols and structures suitable for the C code to digest. Once the symbols and structures have been initialized, the interface function GetTree
is called to return a Guid
of the ASCII tree.
Creating a Semtrex Tree from an ASCII String
public void ToSemtrex(object sender, EventArgs args)
{
parseExprID = ApplicationController.CeptrInterface.ParseSemtrex(
ApplicationController.CeptrInterface.RootSymbolsNode,
ApplicationController.CeptrInterface.RootStructuresNode,
View.tbParseExpr.Text
);
DumpOutput(parseExprID);
}
Matching the ASCII Input String
Using the tree ID's obtained from the methods described above, we match the latlon input string against the parsed Semtrex tree, :
public void Match(object sender, EventArgs args)
{
Tuple<bool, Guid> result = ApplicationController.CeptrInterface.Match(parseExprID, asciiTreeID);
if (result.Item1)
{
View.tbMatchResult.Text = "True";
matchResultTreeID = result.Item2;
DumpOutput(matchResultTreeID);
}
else
{
View.tbMatchResult.Text = "False";
View.tbSemtrexTree.Text = "";
}
}
Embody the Result
We embody the match result tree with the input values:
public void Embody(object sender, EventArgs args)
{
embodyID = ApplicationController.CeptrInterface.Embody(
ApplicationController.CeptrInterface.RootSymbolsNode,
ApplicationController.CeptrInterface.RootStructuresNode,
matchResultTreeID,
asciiTreeID
);
DumpOutput(embodyID);
}
Parsing the Match String
As mentioned earlier, we again need to parse the human-readable string representing our match expression into a Semtrex expression:
public void ToMatchAgainst(object sender, EventArgs args)
{
matchAgainstID = ApplicationController.CeptrInterface.ParseSemtrex(
ApplicationController.CeptrInterface.RootSymbolsNode,
ApplicationController.CeptrInterface.RootStructuresNode,
View.tbMatchAgainst.Text
);
DumpOutput(matchAgainstID);
}
Matching Against the Embodied Semtrex
Finally, in our last step, we use the embodyID
from above along with the matchAgainstID
to match the embodied tree to the Semtrex match expression:
public void MatchAgainstMatches(object sender, EventArgs args)
{
bool ret = ApplicationController.CeptrInterface.MatchTest(matchAgainstID, embodyID);
View.tbMatchResult2.Text=(ret ? "True" : "False");
}
On the 'C' Side
As mentioned in the introduction, Eric implemented the actual algorithms in C for portability and performance. He's also written a large suite of unit tests which we'll give an example of as well.
Semtrex Parsing
Some of the Semtrex parsing was inspired by Russ Cox' article Regular Expression Matching Can Be Simple and Fast, and in particular the algorithm for flattening an non-finite automata (NFA) to a finite one. Here, for example, is the C code that create the finite automata:
char * __stx_makeFA(T *t,SState **in,Ptrlist **out,int level,int *statesP) {
SState *s,*i,*last,*s1,*s2;
Ptrlist *o,*o1;
char *err;
int state_type = -1;
int x;
SemanticID group_symbol;
int group_id;
T *v;
int c = _t_children(t);
Symbol sym = _t_symbol(t);
switch(sym.id) {
case SEMTREX_VALUE_LITERAL_ID:
case SEMTREX_VALUE_LITERAL_NOT_ID:
state_type = StateValue;
s = state(state_type,statesP);
s->data.value.flags = (sym.id == SEMTREX_VALUE_LITERAL_NOT_ID) ? LITERAL_NOT : 0;
v = _t_child(t,1);
if (!v) {
raise_error0("expecting value or SEMTREX_VALUE_SET as first child of SEMTREX_VALUE_LITERAL");
}
if (semeq(_t_symbol(v),SEMTREX_VALUE_SET)) s->data.value.flags |= LITERAL_SET;
s->data.value.values = _t_clone(v);
*in = s;
s->transition = level;
*out = list1(&s->out);
break;
case SEMTREX_SYMBOL_LITERAL_ID:
case SEMTREX_SYMBOL_LITERAL_NOT_ID:
state_type = StateSymbol;
v = _t_child(t,1);
int is_set;
Symbol vsym = _t_symbol(v);
if (!v || !((is_set = semeq(SEMTREX_SYMBOL_SET,vsym)) || semeq(SEMTREX_SYMBOL,vsym))) {
raise_error0("expecting SEMTREX_SYMBOL_SET or SEMTREX_SYMBOL as first child of SEMTREX_SYMBOL_LITERAL");
}
if (c > 2) return "Symbol literal must have 0 or 1 children other than the symbol/set";
s = state(state_type,statesP);
s->data.symbol.flags = (sym.id == SEMTREX_SYMBOL_LITERAL_NOT_ID) ? LITERAL_NOT : 0;
if (is_set) s->data.symbol.flags |= LITERAL_SET;
s->data.symbol.symbols = _t_clone(v);
*in = s;
if (c > 1) {
s->transition = TransitionDown;
err = __stx_makeFA(_t_child(t,2),&i,&o,level-1,statesP);
if (err) return err;
s->out = i;
*out = o;
}
else {
s->transition = level;
*out = list1(&s->out);
}
break;
case SEMTREX_SYMBOL_ANY_ID:
state_type = StateAny;
if (c > 1) return "Symbol any must have 0 or 1 children";
s = state(state_type,statesP);
*in = s;
if (c > 0) {
s->transition = TransitionDown;
err = __stx_makeFA(_t_child(t,1),&i,&o,level-1,statesP);
if (err) return err;
s->out = i;
*out = o;
}
else {
s->transition = level;
*out = list1(&s->out);
}
break;
case SEMTREX_SEQUENCE_ID:
if (c == 0) return "Sequence must have children";
last = 0;
for(x=c;x>=1;x--) {
err = __stx_makeFA(_t_child(t,x),&i,&o,level,statesP);
if (err) return err;
if (last) patch(o,last,level);
else *out = o;
last = i;
*in = i;
}
break;
case SEMTREX_OR_ID:
if (c != 2) return "Or must have 2 children";
s = state(StateSplit,statesP);
*in = s;
err = __stx_makeFA(_t_child(t,1),&i,&o,level,statesP);
if (err) return err;
s->out = i;
err = __stx_makeFA(_t_child(t,2),&i,&o1,level,statesP);
if (err) return err;
s->out1 = i;
*out = append(o,o1);
break;
case SEMTREX_ZERO_OR_MORE_ID:
if (c != 1) return "Star must have 1 child";
s = state(StateSplit,statesP);
*in = s;
err = __stx_makeFA(_t_child(t,1),&i,&o,level,statesP);
if (err) return err;
s->out = i;
patch(o,s,level);
*out = list1(&s->out1);
break;
case SEMTREX_ONE_OR_MORE_ID:
if (c != 1) return "Plus must have 1 child";
s = state(StateSplit,statesP);
err = __stx_makeFA(_t_child(t,1),&i,&o,level,statesP);
if (err) return err;
*in = i;
s->out = i;
patch(o,s,level);
*out = list1(&s->out1);
break;
case SEMTREX_ZERO_OR_ONE_ID:
if (c != 1) return "Question must have 1 child";
s = state(StateSplit,statesP);
*in = s;
err = __stx_makeFA(_t_child(t,1),&i,&o,level,statesP);
if (err) return err;
s->out = i;
*out = append(o,list1(&s->out1));
break;
case SEMTREX_GROUP_ID:
if (c != 1) return "Group must have 1 child";
s = state(StateGroupOpen,statesP);
*in = s;
group_symbol = *(SemanticID *)_t_surface(t);
group_id = ++G_group_id;
s->data.groupo.symbol = group_symbol;
s->data.groupo.uid = group_id;
err = __stx_makeFA(_t_child(t,1),&i,&o,level,statesP);
if (err) return err;
s->out = i;
s1 = state(StateGroupClose,statesP);
patch(o,s1,level);
s1->data.groupc.openP = s;
*out = list1(&s1->out);
break;
case SEMTREX_DESCEND_ID:
if (c != 1) return "Descend must have 1 child";
s = state(StateDescend,statesP);
*in = s;
err = __stx_makeFA(_t_child(t,1),&i,&o,level-1,statesP);
if (err) return err;
s->out = i;
*out = o;
break;
case SEMTREX_NOT_ID:
if (c != 1) return "Not must have 1 child";
s = state(StateNot,statesP);
*in = s;
err = __stx_makeFA(_t_child(t,1),&i,&o,level,statesP);
if (err) return err;
s->out = i;
*out = append(o,list1(&s->out1));
break;
case SEMTREX_WALK_ID:
if (c != 1) return "Walk must have 1 child";
s = state(StateWalk,statesP);
*in = s;
err = __stx_makeFA(_t_child(t,1),&i,&o,level,statesP);
if (err) return err;
s->out = i;
*out = o;
break;
default:
return "Unknown SEMTREX SYMBOL";
}
return 0;
}
SState * _stx_makeFA(T *t,int *statesP) {
SState *in;
Ptrlist *o;
G_group_id = 0;
char *err = __stx_makeFA(t,&in,&o,0,statesP);
if (err != 0) {raise_error0(err);}
patch(o,&matchstate,0);
return in;
}
Unit Tests
We'll look at the unit test for the above code. First, a couple macros:
#define _sl(t,s) __sl(t,0,1,s)
#define _sln(t,s) __sl(t,1,1,s)
These are helpers for calling a function to create a Semtrex symbols set:
T *__sl(T *p, int not,int count, ...) {
va_list symbols;
T *t = _t_newr(p,not ? SEMTREX_SYMBOL_LITERAL_NOT : SEMTREX_SYMBOL_LITERAL);
T *ss = count > 1 ? _t_newr(t,SEMTREX_SYMBOL_SET) : t;
va_start(symbols,count);
int i;
for(i=0;i<count;i++) {
_t_news(ss,SEMTREX_SYMBOL,va_arg(symbols,Symbol));
}
va_end(symbols);
return t;
}
This is employed in the function that creates the test Semtrex:
T *_makeTestSemtrex1() {
T *s = _sl(0,TEST_STR_SYMBOL);
T *ss = _t_newi(s,SEMTREX_SEQUENCE,0);
T *s1 = _sl(ss,sy1);
T *s11 = _sl(s1,sy11);
T *s111 = _sl(s11,sy111);
T *s2 = _sl(ss,sy2);
T *s3 = _sl(ss,sy3);
return s;
}
Utilized by the test function:
void testMakeFA() {
SState *s1, *s2, *s3, *s4, *s5, *s6;
T *s = _makeTestSemtrex1();
int states = 0;
SState *sa = _stx_makeFA(s,&states);
spec_is_equal(states,6);
spec_state_equal(sa,StateSymbol,TransitionDown,TEST_STR_SYMBOL);
s1 = sa->out;
spec_state_equal(s1,StateSymbol,TransitionDown,sy1);
s2 = s1->out;
spec_state_equal(s2,StateSymbol,TransitionDown,sy11);
s3 = s2->out;
spec_state_equal(s3,StateSymbol,-2,sy111);
s4 = s3->out;
spec_state_equal(s4,StateSymbol,TransitionNextChild,sy2);
s5 = s4->out;
spec_state_equal(s5,StateSymbol,TransitionUp,sy3);
s6 = s5->out;
spec_is_equal(s6->type,StateMatch);
spec_is_ptr_equal(s6->out,NULL);
_stx_freeFA(sa);
_t_free(s);
}
Semtrex Tokens - Detail Section for Regular Expression Geeks
The format for this section describes each:
- Semantic symbol (the sub-header in this section)
- Its textual representation
- What it matches
- Example
- Additional explanation
SEMTREX_SYMBOL_LITERAL
The name of the symbol that it matches
It matches a tree node of a specific symbol
/%HomeLocation/(lat=42.25,lon=73.25)
HomeLocation is the symbol literal
SEMTREX_VALUE_LITERAL
The name of the symbol = a value (quoted from text)
It matches a tree node of the given symbol and a specific value
/%HomeLocation/(lat=42.25,lon=73.25)
lat=42.25 matches on both the symbol "lat" and the value "42.25"
SEMTREX_SYMBOL_LITERAL_NOT
! followed by the symbol name
It matches a tree node of any symbol except the given one
!PUNCTUATION
SEMTREX_VALUE_LITERAL_NOT
The name of the symbol != a value
It matches a tree node of the given symbol and any value that doesn't match it
ASCII_CHAR!=' '+
Matches any ASCII character node whose value doesn't equal space
SEMTREX_SYMBOL_SET
{comma delimited symbols}
It matches a tree node in the given set of symbols
{lat, lon}
Matches a lat or lon node
SEMTREX_VALUE_SET
symbol = { comma delimited values }
symbol != { comma delimited values }
It matches a tree node whose symbol is the given type and whose value matches (or doesn't match) a value in the set
ASCII_CHAR!={'/','?',' '}
Matches an ASCII character with one of /, ?, or space
SEMTREX_ZERO_OR_MORE
The textual representation is a * (asterisk)
It matches zero or more sibling nodes
ASCII_CHAR!={'/','?',' '}*
Matches zero or more ASCII characters nodes whose value is not in the set /, ?, and space
SEMTREX_ONE_OR_MORE
The textual representation is a + (plus)
It matches one or more sibling nodes
ASCII_CHAR!={'/','?',' '}+
Matches one or more ASCII characters nodes whose value is not in the set /, ?, and space
SEMTREX_ZERO_OR_ONE
The textual representation is a ? (questionmark)
It matches zero one sibling nodes
ASCII_CHAR!={'/','?',' '}?
Matches zero or one ASCII characters nodes whose value is not in the set /, ?, and space
SEMTREX_SYMBOL_ANY
The textual representation is a . (dot)
It matches a tree node of any symbol
.*
Zero or more of any symbol
SEMTREX_OR
The textual representation is a | (pipe)
It matches a tree that matches one expression or another express. On either side of the or can be any Semtrex expression
lon=72.25 | lon=68.32
HomeLocation | BusinessLocation
The above illustrates matching two different expressions. The first one is a value literal match, the second one is a symbol match.
SEMTREX_SEQUENCE
The textual representation is a , (comma)
Matches a sibling tree followed by another sibling tree, etc.
ASCII_CHAR='H',ASCII_CHAR='T',ASCII_CHAR='T',ASCII_CHAR='P'
Matches on the four ASCII characters HTTP
SEMTREX_GROUP
<name : expr>
Allows you to associate a symbolic name with what matched or group larger expressions for use as part of other semtrex expressions.
<HomeLocation:<lat:ASCII_CHAR+>,ASCII_CHAR=',',<lon:ASCII_CHAR+>>
This a naming of a sequence of a named group, a comma, and another named group.
SEMTREX_WALK
The textual representation is a % (percent sign)
The following Semtrex expression can match anywhere in the tree rather than the current matching node position.
/%HomeLocation/(lat=42.25,lon=73.25)
This would match on a specific home location found somewhere in the tree rather at the root (in this example.)
Siblings
() Parentheses for disambiguating what's a sibling and what's a child. There is no Semtrex symbol for this because a tree encodes this information directly.
References
1. http://en.wikipedia.org/wiki/Semantics
2. http://en.wikipedia.org/wiki/Great-circle_distance
3. "W3C Semantic Web Activity". World Wide Web Consortium (W3C). November 7, 2011. Retrieved November 26, 2011.
4. http://en.wikipedia.org/wiki/Tim_Berners-Lee
5. Berners-Lee, Tim; James Hendler; Ora Lassila (May 17, 2001). "The Semantic Web". Scientific American Magazine. Retrieved March 26, 2008.
6. http://dictionary.reference.com/browse/Phylogeny
7. http://en.wikipedia.org/wiki/Parse_tree
8. http://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form
9. http://en.wikipedia.org/wiki/Ontology_engineering
10. http://tools.ietf.org/html/rfc5321
11. http://json.org/