Introduction
In line of business projects, you frequently need to generate complex schemas. This article outlines rapid prototyping of high quality/maintainable schema from sample XML such that derivative object models generate cleanly in all platforms.
Background
There are many tools for creating and managing schema, most are "not fun". Starting from a blank screen on a complex tool is daunting, especially when you're writing in an abstract language like XML Schema (XSD).
It's most beneficial to create a sample, then use existing tools to generate schema. The problem with these schema generating tools is that they nest complex types... Nesting complex types causes two problems:
- It's ugly/hard to maintain
- Generators will build very ugly objects from this kind of schema
- It does not follow general industry practices for XML Schema (
msdata
namespace)
By Example
I started my data modeling using a sample; in this case, I want to model Cub Scout pinewood derby race data (yes, I have an 8yo boy).
<Derby>
<Racers>
<Group Name="Den7">
<Cub Id="1" First="Johny" Last="Racer" Place="1"/>
<Cub Id="2" First="Stan" Last="Lee" Plac="3"/>
</Group>
...
If I run XSD.exe (included in the .NET SDK) on that XML, it would generate XSD like:
<xs:schema id="Derby" xmlns=""
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:msdata="urn:schemas-microsoft-com:xml-msdata">
<xs:element name="Derby"
msdata:IsDataSet="true"
msdata:UseCurrentLocale="true">
<xs:complexType>
<xs:choice minOccurs="0" maxOccurs="unbounded">
<xs:element name="Racers">
<xs:complexType>
<xs:sequence>
<xs:element name="Group"
minOccurs="0"
maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element name="Cub"
minOccurs="0"
maxOccurs="unbounded">
<xs:complexType>
...
Notice all the nesting... When you then run xsd.exe on the generated derby.xsd... it will generate objects with names like: DerbyRacersGroupCub
. Bleck!
The Better Schema
<xs:schema xmlns=""
xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="Derby" type="DerbyInfo" />
<xs:complexType name="DerbyInfo">
<xs:sequence>
<xs:element name="Racers" type="RacersInfo" />
<xs:element name="Races" type="RacesInfo" />
</xs:sequence>
</xs:complexType>
...
Improve Xml2Xsd
So I set out to solve all these problems and built a better/simpler generator.
Algorithm Overview
- Open an
XDocument
for the sample XML. - Read all the elements and build a dictionary of XPaths. I used a dictionary, but a
List<string />
with Distinct()
could have worked too. - From the list of XPaths, drive through all the XPaths and build the attribute and elements, making sure to reference all new elements, instead of nesting.
High Level Static Method
public static XDocument Generate(XDocument content, string targetNamespace)
{
xpaths.Clear();
elements.Clear();
recurseElements.Clear();
RecurseAllXPaths(string.Empty, content.Elements().First());
target = XNamespace.Get(targetNamespace);
var compTypes = xpaths.Select(k => k.Key)
.OrderBy(o => o)
.Select(k => ComplexTypeElementFromXPath(k))
.Where(q => null != q).ToArray();
compTypes[0] = compTypes.First().Element(xs +
"sequence").Element(xs + "element");
return new XDocument(new XElement(target + "schema",
new XAttribute("elementFormDefault", "qualified"),
new XAttribute("targetNamespace", targetNamespace),
new XAttribute(XNamespace.Xmlns + "xs",
"http://www.w3.org/2001/XMLSchema"),
compTypes));
}
Recurse All XPaths
For each element, find if it's distinct, look for repeating element names (recursively defined) elements, and track them.
static void RecurseAllXPaths(string xpath, XElement elem)
{
var missingXpath = !xpaths.ContainsKey(xpath);
var lclName = elem.Name.LocalName;
var hasLcl = elements.ContainsKey(lclName);
if (hasLcl && missingXpath)
RecurseElements.Add(lclName);
else if (!hasLcl)
elements.Add(lclName, true);
if (missingXpath)
xpaths.Add(xpath, null);
elem.Attributes().ToList().ForEach(attr =>
{
var xpath1 = string.Format("{0}/@{1}", xpath, attr.Name);
if (!xpaths.ContainsKey(xpath1))
xpaths.Add(xpath1, null);
});
elem.Elements().ToList().ForEach(fe => RecurseAllXPaths(
string.Format("{0}/{1}", xpath, lclName), fe));
}
Generating Schema From XPaths
Now that we have a list of XPaths, we need to generate the appropriate schema for them.
private static XElement ComplexTypeElementFromXPath(string xp)
{
var parts = xp.Split('/');
var last = parts.Last();
var isAttr = last.StartsWith("@");
var parent = ParentElementByXPath(parts);
return (isAttr) ? BuildAttributeSchema(xp, last, parent) :
BuildElementSchema(xp, last, parent);
}
BuildAttributeSchema
private static XElement BuildAttributeSchema(string k,
string last, XElement parent)
{
var elem0 = new XElement(xs + "attribute",
new XAttribute("name", last.TrimStart('@')),
new XAttribute("type", "string"));
if (null != parent)
parent.Add(elem0);
xpaths[k] = elem0;
return null;
}
BuildElementSchema
This one is not as straightforward as BuildAttribute
; we have to make sure we have the appropriate "type-references" made to the parent node... it's a little hairy, but it works nicely.
private static XElement BuildElementSchema(string k,
string last, XElement parent)
{
XElement seqElem = null;
if (null != parent)
{
seqElem = parent.Element(xs + "sequence");
if (null == seqElem && null != parent)
parent.AddFirst(seqElem = new XElement(xs + "sequence"));
}
else
{
seqElem = new XElement(xs + "sequence");
}
var lastInfo = last + "Info";
var elem0 = new XElement(xs + "element",
new XAttribute("name", last),
new XAttribute("type", lastInfo));
seqElem.Add(elem0);
return xpaths[k] = new XElement(xs + "complexType",
new XAttribute("name", lastInfo));
}
Using the Code
- Download the sample project
- Build in VS2010 or Express
- F5 from the debug solution will execute
- Open Derby.Xsd in bin/Debug to see the result
If you're still reading, I strongly recommend F10/F11 through the project to get into the details. Have fun!
Enhancements
- Elements without children (a.k.a. value elements)
- Derive data types from the contents of the sample XML (integer, boolean, DateTime, etc.)
Future Improvements
- Make recursively defined elements work
History