Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / programming / algorithm

XmlToXsd - A Better Schema Generator

4.93/5 (13 votes)
7 Dec 2010CPOL2 min read 44K   1.7K  
Build better schema for rapid data model prototyping.

Introduction

In line of business projects, you frequently need to generate complex schemas. This article outlines rapid prototyping of high quality/maintainable schema from sample XML such that derivative object models generate cleanly in all platforms.

Background

There are many tools for creating and managing schema, most are "not fun". Starting from a blank screen on a complex tool is daunting, especially when you're writing in an abstract language like XML Schema (XSD).

It's most beneficial to create a sample, then use existing tools to generate schema. The problem with these schema generating tools is that they nest complex types... Nesting complex types causes two problems:

  • It's ugly/hard to maintain
  • Generators will build very ugly objects from this kind of schema
  • It does not follow general industry practices for XML Schema (msdata namespace)

By Example

I started my data modeling using a sample; in this case, I want to model Cub Scout pinewood derby race data (yes, I have an 8yo boy).

XML
<Derby>
    <Racers>
        <Group Name="Den7">
            <Cub Id="1" First="Johny" Last="Racer" Place="1"/>
            <Cub Id="2" First="Stan" Last="Lee" Plac="3"/>
        </Group>
        ...

If I run XSD.exe (included in the .NET SDK) on that XML, it would generate XSD like:

XML
<xs:schema id="Derby" xmlns="" 
    xmlns:xs="http://www.w3.org/2001/XMLSchema" 
    xmlns:msdata="urn:schemas-microsoft-com:xml-msdata">
  <xs:element name="Derby" 
       msdata:IsDataSet="true" 
       msdata:UseCurrentLocale="true">
    <xs:complexType>
      <xs:choice minOccurs="0" maxOccurs="unbounded">
        <xs:element name="Racers">
          <xs:complexType>
            <xs:sequence>
              <xs:element name="Group" 
                   minOccurs="0" 
                   maxOccurs="unbounded">
                <xs:complexType>
                  <xs:sequence>
                    <xs:element name="Cub" 
                         minOccurs="0" 
                         maxOccurs="unbounded">
                      <xs:complexType>
                      ...

Notice all the nesting... When you then run xsd.exe on the generated derby.xsd... it will generate objects with names like: DerbyRacersGroupCub. Bleck!

The Better Schema

XML
<xs:schema xmlns="" 
        xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="Derby" type="DerbyInfo" />
  <xs:complexType name="DerbyInfo">
    <xs:sequence>
      <xs:element name="Racers" type="RacersInfo" />
      <xs:element name="Races" type="RacesInfo" />
    </xs:sequence>
  </xs:complexType>
  ...

Improve Xml2Xsd

So I set out to solve all these problems and built a better/simpler generator.

Algorithm Overview

  • Open an XDocument for the sample XML.
  • Read all the elements and build a dictionary of XPaths. I used a dictionary, but a List<string /> with Distinct() could have worked too.
  • From the list of XPaths, drive through all the XPaths and build the attribute and elements, making sure to reference all new elements, instead of nesting.

High Level Static Method

C#
public static XDocument Generate(XDocument content, string targetNamespace)
{
    xpaths.Clear();
    elements.Clear();
    recurseElements.Clear();

    RecurseAllXPaths(string.Empty, content.Elements().First());

    target = XNamespace.Get(targetNamespace);

    var compTypes = xpaths.Select(k => k.Key)
        .OrderBy(o => o)
        .Select(k => ComplexTypeElementFromXPath(k))
        .Where(q => null != q).ToArray();

    // The first one is our root element... it needs to be extracted and massage
    compTypes[0] = compTypes.First().Element(xs + 
                     "sequence").Element(xs + "element");

    // Warning: Namespaces are tricky/hinted here, be careful
    return new XDocument(new XElement(target + "schema",
        // Why 'qualified'?
        // All "qualified" elements and
        // attributes are in the targetNamespace of the
        // schema and all "unqualified"
        // elements and attributes are in no namespace.
        //  All global elements and attributes are qualified.
        new XAttribute("elementFormDefault", "qualified"),

        // Specify the target namespace,
        // you will want this for schema validation
        new XAttribute("targetNamespace", targetNamespace),
                
        // hint to xDocument that we want
        // the xml schema namespace to be called 'xs'
        new XAttribute(XNamespace.Xmlns + "xs", 
                       "http://www.w3.org/2001/XMLSchema"),
                       compTypes));
}

Recurse All XPaths

For each element, find if it's distinct, look for repeating element names (recursively defined) elements, and track them.

C#
static void RecurseAllXPaths(string xpath, XElement elem)
{
    var missingXpath = !xpaths.ContainsKey(xpath);
    var lclName = elem.Name.LocalName;

    var hasLcl = elements.ContainsKey(lclName);

    // Check for recursion in the element name (same name different level)
    if (hasLcl && missingXpath)
        RecurseElements.Add(lclName);
    else if (!hasLcl)
        elements.Add(lclName, true);

    // if it's not in the xpath, then add it.
    if (missingXpath)
        xpaths.Add(xpath, null);

    // add xpaths for all attributes
    elem.Attributes().ToList().ForEach(attr =>
        {
            var xpath1 = string.Format("{0}/@{1}", xpath, attr.Name);
            if (!xpaths.ContainsKey(xpath1))
                xpaths.Add(xpath1, null);
        });

    elem.Elements().ToList().ForEach(fe => RecurseAllXPaths(
        string.Format("{0}/{1}", xpath, lclName), fe));
}

Generating Schema From XPaths

Now that we have a list of XPaths, we need to generate the appropriate schema for them.

C#
private static XElement ComplexTypeElementFromXPath(string xp)
{
    var parts = xp.Split('/');
    var last = parts.Last();
    var isAttr = last.StartsWith("@");
    var parent = ParentElementByXPath(parts);

    return (isAttr) ? BuildAttributeSchema(xp, last, parent) : 
        BuildElementSchema(xp, last, parent);
}

BuildAttributeSchema

C#
private static XElement BuildAttributeSchema(string k, 
               string last, XElement parent)
{
    var elem0 = new XElement(xs + "attribute",
        new XAttribute("name", last.TrimStart('@')),
        new XAttribute("type", "string"));
            
    if (null != parent)
        parent.Add(elem0);

    xpaths[k] = elem0;

    return null;
}

BuildElementSchema

This one is not as straightforward as BuildAttribute; we have to make sure we have the appropriate "type-references" made to the parent node... it's a little hairy, but it works nicely.

C#
private static XElement BuildElementSchema(string k, 
               string last, XElement parent)
{
    XElement seqElem = null;
    if (null != parent)
    {
        seqElem = parent.Element(xs + "sequence");

        // Add a new squence if one doesn't already exist
        if (null == seqElem && null != parent)
            // Note: add sequence to the start,
            //  because sequences need to come before any 
            //  attributes in XSD syntax
            parent.AddFirst(seqElem = new XElement(xs + "sequence"));
    }
    else
    {
        // In this case, there's no existing parent
        seqElem = new XElement(xs + "sequence");
    }

    var lastInfo = last + "Info";

    var elem0 = new XElement(xs + "element",
            new XAttribute("name", last),
            new XAttribute("type", lastInfo));
    seqElem.Add(elem0); // add the ref to the existing sequence

    return xpaths[k] = new XElement(xs + "complexType",
        new XAttribute("name", lastInfo));
}

Using the Code

  • Download the sample project
  • Build in VS2010 or Express
  • F5 from the debug solution will execute
  • Open Derby.Xsd in bin/Debug to see the result

If you're still reading, I strongly recommend F10/F11 through the project to get into the details. Have fun!

Enhancements

  • Elements without children (a.k.a. value elements)
  • Derive data types from the contents of the sample XML (integer, boolean, DateTime, etc.)

Future Improvements

  • Make recursively defined elements work

History

  • 12/04/2010 - Created.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)