Introduction
This is the second in a series of three articles describing a project that I undertook to solve a specific problem in my working environment. The first article explained how I was dissatisfied with SQL's Data Definition Language, and decided to create an alternative based on XML. By the end of the article, I had presented an XSD for the new format.
I knew I would need some way of transforming my XML into SQL so I did some Googling, and in the end I considered two alternatives: XSLT or a bespoke program. Tools for code generation usually feature a template file with some sort of 'special' markup that demarcates areas of embedded code, e.g. in an ASP.NET aspx file:
<%=SomeProperty%>
This markup is replaced by the value of the public
property SomeProperty
of the page
class when the page is viewed. XSLT is more extreme in that it contains fragments of output surrounded by large volumes of XML query language. Consider this XSLT which is designed to do just what my project does: convert an object model into an SQL schema definition. My complaint with this approach to code generation is that it is essentially write-only code. It is hard to visualize the intended output, because it is obscured by the embedded code. Visual Studio can't recognize the fragments of code as being SQL so it won't syntax colour them, and if you want to generate some code that includes any of XML's special characters, then you have to escape them.
So for these reasons, I decided to write a bespoke, command-line program, which performs the following steps:
- Read command line
- Deserialize database schema
- Build code model
- Generate SQL based on model
I will leave everything else for the final article, and for now just discuss what a code model is, why they are a good idea and how I decided to build mine.
Why Have a Code Model?
In the sense that I am using it, 'code model' just means a representation of a data file format written in some programming language. For example, scripts on a web page use the Document Object Model to inspect and modify web pages dynamically. Of course C# already has a generic XML code model that I could use to read my file format, and I can even get it to validate the XML against my schema. A couple of lines of code and I have my whole file in an XMLDocument
ready to use.
XmlDocument mySchema = new XmlDocument();
mySchema.Load("mySchema.xml");
However, there are several problems with this simplistic approach. When I am writing code, to use my data I have to know the structure of my XML document. This may not always be obvious because of quirks and limitation in XML or XSD. I also have to know the names of elements and attributes in my XML document, and put them somewhere in my code as constant strings. For example:
foreach (XmlNode tableNode in mySchema.DocumentElement["tables"].ChildNodes)
{
XmlElement tableElement = tableNode as XmlElement;
if (tableElement != null)
{
foreach (XmlNode columnNode in tableElement["columns"].ChildNodes)
{
XmlElement columnElement = columnNode as XmlElement;
if (columnElement != null)
{
if (columnElement.FirstChild.Name == "nchar")
{
XmlElement ncharElement =
columnElement["nchar"];
XmlAttribute ncharLengthAttribute =
ncharElement.Attributes["length"];
int ncharLength = 1;
if (ncharLengthAttribute != null)
{
ncharLength = int.Parse
(ncharLengthAttribute.Value);
}
if (ncharLength > 10)
{
}
}
}
}
}
}
The next problem illustrated by this example is type conversion. The XML code model can't possibly know the type of attributes at compile-time, so everything has to be returned as string
s and they have to be converted manually. The end result is verbose, obfuscated code littered with assumed knowledge. It will also be hard to make any changes to the file format later, because there is no compile-time checking of these assumptions.
A code model solves these problems, essentially by hiding all the assumed knowledge and keeping it in one place. It can smooth out structural artifacts from the XML and convert all the string
s to their proper types. Later, changes to the file format can either be hidden from client code by keeping the code model the same and processing the new format on load, or exposed to the client code as changes that break at compile time and can be confidently fixed.
Code Model Implementation
Microsoft has provided a handy tool called XSD.exe that will generate a code model from an XSD file that can be deserialized directly from any XML file that uses the XSD. The XSD for my file format is embedded as a resource in an assembly (see the project DatabaseSchemaModelResource
in the sample code), for reasons which will become clear later on. The post-build event of this project runs XSD.exe and generates the file DatabaseSchemaModel.cs which is included in another project; DatabaseSchemaModel
. This project is set as a dependency of DatabaseSchemaModelResource
, so that the generated code file is always up to date before DatabaseSchemaModel
is compiled.
Note that XSD.exe may be in different places depending on your setup, so if you have problems building that may be why. To fix it, you'll have to edit the post-build event of the DatabaseSchemaModelResource project.
The generated code is type-safe where possible and provides a great intermediate set of data to work with; much nicer than poking around directly in the XML DOM. But it still was not good enough for me. For one thing, it still directly corresponded to the structure of the XML, and for another, it didn't handle xs:choice
very nicely: just treating it as an object
. The clincher was the fact that in the file format, objects refer to each other by name (e.g. a primary key refers to one or more columns). In the code model, it would be much nicer if the objects were linked directly (e.g. if a primary key object had a direct reference to one or more column objects). So I decided to call this generated code my 'raw' code model. I would hide it from the client, but it would still save me a lot of work in parsing and validating and provide a type-safe foundation.
The classes of the public facing code model are all in the DatabaseSchemaModel
namespace in the project of the same name. The root class of the model is called Database
. When thinking about how the model would be loaded from the file, it seemed natural for this class to be XML serializable directly to and from the file format. As I wanted to use my raw code model to do the heavy lifting, it was clear I would have to implement IXmlSerializable
. Here is the declaration of the Database
class:
...
[XmlRootAttribute(Namespace = http:
IsNullable = false)]
[XmlSchemaProvider("MySchema")]
public class Database : IXmlSerializable
{
...
The XmlRootAttribute
tells the serializer how to write the root node of the file, in this case telling it to include the namespace declaration. The XmlSchemaProvider
attribute is needed to tell the serializer which schema to expect when reading the XML. The schema is actually provided to the framework by a public static
method, referenced in the attribute. This is where I use the XSD file that is embedded as a resource in the DatabaseSchemaModelResource
assembly:
public static XmlQualifiedName MySchema(XmlSchemaSet xs)
{
XmlSerializer schemaSerializer = new XmlSerializer(typeof(XmlSchema));
XmlSchema s = (XmlSchema)schemaSerializer.Deserialize
(new XmlTextReader(new StringReader
(DatabaseSchemaModelResource.Resources.Schema)), null);
xs.XmlResolver = new XmlUrlResolver();
xs.Add(s);
return new XmlQualifiedName("database", xmlNameSpace);
}
The implementation of ReadXml
deserializes the entire raw model (the code generated by XSD.exe) then uses the objects from the raw model to construct the code model. The deserialization looks like this:
public void ReadXml(System.Xml.XmlReader reader)
{
string outerXml = reader.ReadOuterXml();
XmlSerializer serializer = new XmlSerializer(typeof(database));
database db = (database)serializer.Deserialize(new StringReader(outerXml));
Name = db.name;
if (db.ExampleFolderPath != null)
{
ExampleFolderPath = db.ExampleFolderPath;
}
if (db.SQLServerOutputPath != null)
{
SQLServerOutputPath = db.SQLServerOutputPath;
}
...
Notice that the first line reads the entire file as a string
, which is then passed to a new serializer instance. This is definitely not very efficient, but I think that it is unlikely to matter in real use-cases. A bigger issue with this approach is that the deserialization process discards any information about the source file, such as the line number where an element occurs. If any errors occur later during the construction of the model it is not possible to describe the location of the problem in the exception. I'll explain later how I deal with this problem but to continue with the ReadXml
method, the next step is to walk the raw code model and construct the objects for the code model:
...
ReadContext context = new ReadContext();
Tables = new IndexedList<string,Table>();
foreach (table rawTable in db.tables)
{
Tables.Add(rawTable.name, new Table(context, rawTable));
}
Procedures = new IndexedList<string, Procedure>();
foreach (procedure rawProcedure in db.procedures)
{
Procedures.Add(rawProcedure.name, new Procedure(rawProcedure));
}
...
As you can see, the tables and procedures are stored in an IndexedList
which is an ordered container with dictionary-style lookup by name. Each class in the code model has an internal
constructor that takes the corresponding raw model class as a parameter. For example, here is part of the internal constructor for Table
:
internal Table(
...
table raw
)
{
...
Name = raw.name;
foreach (column rawColumn in raw.columns)
{
Columns.Add(rawColumn.name, new Column(..., rawColumn));
}
...
The constructor takes its own description from the raw object, then walks the raw column
s and creates code model column
s by passing on the raw data. This pattern repeats in all the code model classes. To see another pattern that is shared by all the internal constructors, I will expand the same example:
internal Table(
ReadContext context,
table raw
)
{
context.ReadingTable = this;
context.Stack.Push("Reading Table: " + raw.name);
Name = raw.name;
foreach (column rawColumn in raw.columns)
{
Columns.Add(rawColumn.name, new Column(context, rawColumn));
}
...
context.ReadingTable = null;
context.Stack.Pop();
}
The ReadContext
class is passed down the stack through all the nested constructors. By setting the property ReadingTable
for the duration of the constructor, all the nested constructors can discover which Table
they are part of. Similarly the ReadContext
has a stack for reporting progress and each nested constructor adds some detail to the stack. If an error is found in the data, the current state of this stack provides a way to report the location of the error to the user. e.g.
Column does not exist: XXXXXXX
Reading Constraint: Primary
Reading Table: Categories
At the end of the constructor, the ReadingTable
property is nulled and the detail on the stack is popped, ready to read the next Table
.
To finish off this section, I'll return to the problem of fixing up internal references. The problem is that a relationship from one Table
may refer to another Table
that has not been loaded yet. One way of solving this problem would be to store the name of the table in the class. Once the whole model is loaded, it is easy to look up the table by name as-needed, and hide the lookup in a property getter. I prefer not to do this because I don't like the extra data members hanging around, and I don't like the slightly boilerplate code you end up with. Instead, the ReadContext
class is responsible for remembering each Table
reference that needs to be resolved during deserialization. For example, the Relationship
class has a reference to the 'primary key table' that needs to be resolved. The following code is called during its constructor:
internal Relationship(
ReadContext context,
relationship raw
)
{
...
context.ResolveTableReferences.Add(raw.primaryKeyTable.name,
delegate(Table table)
{
PrimaryKeyTable = table;
foreach (relationshipColumn rawRelationshipColumn in
rawPrimaryKeyTableColumns)
{
PrimaryKeyColumns.Add(table.ResolveColumn
(stackTrace, rawRelationshipColumn.name));
}
});
...
The ResolveTableReferences
property is a MultiDictionary
, which means it can have multiple values for a single key. The key is the name of the table to be resolved, and the value being added here is a delegate that will assign a reference to the table when it is called. The advantage of using an anonymous method is that the compiler automatically handles the fact that I've used properties of the RelationShip
class by storing a hidden pointer to the Relationship
instance (it creates a closure). Later, after all the tables have been deserialized, the following code is called at the end of Database.ReadXml
:
...
foreach (var resolveTableReferenceContainer in context.ResolveTableReferences)
{
Table table = Tables[resolveTableReferenceContainer.Key];
foreach (var resolveTableReference in
resolveTableReferenceContainer.Value)
{
resolveTableReference(table);
}
}
}
Each key is a table name to be resolved, and the corresponding value is a container full of delegates. Once the table has been looked up, each delegate in the container is called and the table is passed as a parameter.
With one exception, all the classes in the code model map to types in the XSD, which ultimately map to elements in a database schema so I won't go into a lot of detail about them. The exception is in the Column
class. In the XSD, there is a nested element describing the type-specific attributes of the column (e.g. length
). In the code model, I have chosen to 'flatten' these properties into the Column
class. This is a debatable decision because it does mean there are several irrelevant properties in any given Column
instance. However, my feeling is that at the point you are using a column's type specific properties you already know what they are (e.g. that an nchar
column has length
) and so it is a bearable compromise. Here is the code of the internal constructor for Column
:
internal Column(
ReadContext context,
column raw
)
{
Name = raw.name;
Type = raw.ItemElementName.ToString();
if (raw.allowNullsSpecified)
{
AllowNulls = raw.allowNulls;
}
switch (raw.Item.GetType().FullName)
{
case "DatabaseSchemaModel.Raw.bigint":
bigint rawBigInt = raw.Item as bigint;
if (rawBigInt.defaultSpecified)
{
Default = rawBigInt.@default.ToString();
}
else
{
Default = rawBigInt.defaultExpression;
}
break;
case "DatabaseSchemaModel.Raw.int":
@int rawInt = raw.Item as @int;
if (rawInt.defaultSpecified)
{
Default = rawInt.@default.ToString();
}
else
{
Default = rawInt.defaultExpression;
}
break;
case "DatabaseSchemaModel.Raw.smallint":
smallint rawSmallInt = raw.Item as smallint;
if (rawSmallInt.defaultSpecified)
{
Default = rawSmallInt.@default.ToString();
}
else
{
Default = rawSmallInt.defaultExpression;
}
break;
case "DatabaseSchemaModel.Raw.tinyint":
tinyint rawTinyInt = raw.Item as tinyint;
if (rawTinyInt.defaultSpecified)
{
Default = rawTinyInt.@default.ToString();
}
else
{
Default = rawTinyInt.defaultExpression;
}
break;
case "DatabaseSchemaModel.Raw.decimal":
@decimal rawDecimal = raw.Item as @decimal;
if (rawDecimal.precision != null)
{
Precision = int.Parse(rawDecimal.precision);
}
if (rawDecimal.scale != null)
{
Scale = int.Parse(rawDecimal.scale);
}
if (rawDecimal.defaultSpecified)
{
Default = rawDecimal.@default.ToString();
}
else
{
Default = rawDecimal.defaultExpression;
}
break;
case "DatabaseSchemaModel.Raw.decimalScale0":
decimalScale0 rawDecimalScale0 = raw.Item as decimalScale0;
Scale = 0;
if (rawDecimalScale0.precision != null)
{
Precision = int.Parse(rawDecimalScale0.precision);
}
if (rawDecimalScale0.defaultSpecified)
{
Default = rawDecimalScale0.@default.ToString();
}
else
{
Default = rawDecimalScale0.defaultExpression;
}
break;
case "DatabaseSchemaModel.Raw.float":
@float rawFloat = raw.Item as @float;
if (rawFloat.mantissaBits != null)
{
MantissaBits = int.Parse(rawFloat.mantissaBits);
}
if (rawFloat.defaultSpecified)
{
Default = rawFloat.@default.ToString();
}
else
{
Default = rawFloat.defaultExpression;
}
break;
case "DatabaseSchemaModel.Raw.real":
@real rawReal = raw.Item as @real;
if (rawReal.defaultSpecified)
{
Default = rawReal.@default.ToString();
}
else
{
Default = rawReal.defaultExpression;
}
break;
case "DatabaseSchemaModel.Raw.variablePrecisionTime":
variablePrecisionTime rawVariablePrecisionTime =
raw.Item as variablePrecisionTime;
if (rawVariablePrecisionTime.fractionalSecondsPrecision != null)
{
FractionalSecondsPrecision = int.Parse
(rawVariablePrecisionTime.fractionalSecondsPrecision);
}
if (rawVariablePrecisionTime.@default != null)
{
Default = Quote(rawVariablePrecisionTime.@default);
}
else
{
Default = rawVariablePrecisionTime.defaultExpression;
}
break;
case "DatabaseSchemaModel.Raw.char":
@char rawByte = raw.Item as @char;
if (rawByte.length != null)
{
if (rawByte.length == "max")
{
MaxLength = true;
}
else
{
Length = int.Parse(rawByte.length);
}
}
if (rawByte.@default != null)
{
Default = Quote(rawByte.@default);
}
else
{
Default = rawByte.defaultExpression;
}
break;
case "DatabaseSchemaModel.Raw.nchar":
nchar rawNChar = raw.Item as nchar;
if (rawNChar.length != null)
{
if (rawNChar.length == "max")
{
MaxLength = true;
}
else
{
Length = int.Parse(rawNChar.length);
}
}
if (rawNChar.@default != null)
{
Default = Quote(rawNChar.@default);
}
else
{
Default = rawNChar.defaultExpression;
}
break;
case "DatabaseSchemaModel.Raw.bit":
bit rawBit = raw.Item as bit;
if (rawBit.defaultSpecified)
{
Default = rawBit.@default ? "1" : "0";
}
else
{
Default = rawBit.defaultExpression;
}
break;
case "DatabaseSchemaModel.Raw.smallmoney":
smallmoney rawSmallMoney = raw.Item as smallmoney;
if (rawSmallMoney.defaultSpecified)
{
Default = rawSmallMoney.@default.ToString();
}
else
{
Default = rawSmallMoney.defaultExpression;
}
break;
case "DatabaseSchemaModel.Raw.money":
money rawMoney = raw.Item as money;
if (rawMoney.defaultSpecified)
{
Default = rawMoney.@default.ToString();
}
else
{
Default = rawMoney.defaultExpression;
}
break;
case "DatabaseSchemaModel.Raw.parameterlessStringType":
parameterlessStringType rawParameterlessStringType =
raw.Item as parameterlessStringType;
if (rawParameterlessStringType.@default != null)
{
Default = Quote(rawParameterlessStringType.@default);
}
else
{
Default =
rawParameterlessStringType.defaultExpression;
}
break;
case "DatabaseSchemaModel.Raw.uniqueidentifier":
uniqueidentifier rawUniqueIdentifier =
raw.Item as uniqueidentifier;
if (rawUniqueIdentifier.@default != null)
{
Default = Quote(rawUniqueIdentifier.@default);
}
else
{
Default = rawUniqueIdentifier.defaultExpression;
}
break;
}
withIdentity rawWithIdentity = raw.Item as withIdentity;
if (rawWithIdentity != null && rawWithIdentity.identity != null)
{
Identity = new Identity(rawWithIdentity.identity);
}
}
As you can see, the switch
is used to identify the actual type of the nested element in the raw code model. Type-specific properties are then copied into the Column
and the irrelevant properties are left uninitialised. There is one more thing worth noting at the end of this method. withIdentity
is an interface that corresponds to the withIdentity
XSD group. XSD.exe does not generate this interface or acknowledge the relationship between types that use the same group in an XSD. However, it does generate partial
classes which makes it possible to add the interface by hand. This is done in 'DatabaseSchemaModelExtended.cs':
namespace DatabaseSchemaModel.Raw
{
public interface withIdentity
{
identity identity
{
get;
}
}
public partial class bigint : withIdentity
{ }
public partial class tinyint : withIdentity
{ }
public partial class @int : withIdentity
{ }
public partial class smallint : withIdentity
{ }
public partial class decimalScale0 : withIdentity
{ }
}
Using the Code Model
In contrast to the raw XML example, loading a database schema with the code model looks like this:
XmlSerializer serializer = new XmlSerializer(typeof(Database));
Database database = (Database)serializer.Deserialize(new StreamReader("mySchema.xml"));
And the (admittedly rather artificial) example of finding an nchar
with length >10
looks like this:
foreach (Table table in database.Tables.Values)
{
foreach (Column column in table.Columns.Values)
{
if (column.Type == "nchar" && column.Length > 10)
{
}
}
}
I find this more readable, and I believe the intent is clearer.
Conclusion
Although there is a modest up-front investment in building a code model, the resulting client code is cleaner and easier to maintain. In the next and final article in the series, I'll be demonstrating this with some SQL code generation, and pulling together the console app that performs the complete transformation.
History
- 29th November, 2010: Initial post