XSLT to transform Excel XML worksheet to named nodes

Anamera

5.00/5 (5 votes)

10 Jun 2014CPOL4 min read

44.1K

568

This XSLT will output an Excel worksheet containing a CSV type of record collection in a more usable XML format.

Download Excel2NamedXML.zip - 3.7 KB

Introduction

Many third party applications output sequential files (rows of "records") in Excel format for data sharing. Unfortunately, compared to normal "CSV" files, Excel's native XML format is not very practical to use, because the columns/record fields (= Excel cells) don't have distinct element names:

<Row>
 <Cell>
   <Data ss:Type="String">123-44-5678</Data>
  </Cell>
  <Cell>
    <Data ss:Type="String">John</Data>
  </Cell>
  <Cell>
    <Data ss:Type="String">Doe</Data>
  </Cell>
  ...
</Row>

This parameterized XSLT will transform Excel's native XML format, by either using the column names from the first row (equivalent to the CSV header row), or from "named column" information, and outputting those as the node names for each record "field".

<row number="1">
        <SSN>123-44-5678</SSN>
        <First_Name>John</First_Name>
        <Last_Name>Doe</Last_Name>
        ...
</row>

By setting a XSL parameter, the first record/row can be skipped if it contains nothing but the column names, or be included if it contains data in "named cells".

Background

The CSV file format remains popular for exchanging data between applications. Typically, the first row of a CSV file contains the column headers - making the file somewhat "self documenting" and allowing for flexible processing based on column names, rather than column number.

Many applications also output their rows of records in Excel. For this specific purpose, the first record (=row) often contains the field names (=column headers) to remain "compatible" with CSV files. If you receive a regular spreadsheet (.xlsx file), use File | Save As | Other Formats | XML Spreadsheet 2003 to create an XML file compatible with this stylesheet.

The problem is that Excel is not aware that the data in the first row is meant to be column names. Consequently it's native XML format stores every "field" in every "record" with the same node name "Cell/Data", which is not practical for processing. While one could "hard code" column numbers, this obviously lacks complete scalability, e.g. if the column order changes or if ever columns are inserted or deleted.

Using the code

Place the XSL file Excel2NamedXML.xsl on your disk (per example the location of your Excel XML File).
Either edit your Excel XML file by hand to include the reference to the XSL file, like so:

<?xml version="1.0" ?>
<?xml-stylesheet type="text/xsl" href="excel2NamedXML.xsl"?>

<!-- anycode line displayed below here is a bug in the codeproject web site and should be ignored -->

or use the provided JavaScript source code "Transform_Book.wsf" as a basis for your batch script or IIS web site that uses DOM methods to dynamically transform the Excel (input) XML document to a new output document.

oXMLin.transformNodeToObject( oXSL, oXMLout );

The JavaScript source code can easily be customized by changing the values in first few lines:

var strExcelFile = "Book.xml";
var strXSLFile = "Excel2NamedXML.xsl";

var aParameters = {
        'nSkipRows' :    '1',
        'nWorksheet':    '1',
	'nSkipEmpty':    '1'
        };

strExcelFile is path to our input document;
strXSLfile is the path to the Excel2NamedXML.xsl file that is included with this article.
If your Excel file does not have a "header" row to skip, set nSkipRows to "0".
If your data is not the first worksheet in your workbook, you can use nWorksheet to specify which sheet in the book to use.
By default empty rows (no columns have any data) will be skipped, which is particularlarly useful at the end of the spreadsheet. To include empty rows set nSkipEmpty to "0".

Points of Interest

While digging for an existing solution to (what I assumed) was a frequently encountered problem, I did run into several related samples that each dealt with a different variation of Excel XML files:

Worksheet nodes with or without parent Workbook nodes (referenced by "rigid" XPATH references that only dealt with that precise structure)
Worksheets that used Excel's "named cell" feature to define the column names, rather than as a textual first row (NamedCell/Name nodes)
Multi-word column names with spaces, that can't be used as node names

The attached style sheet demonstrates how to add flexibility to an XSL template with clever use of generic XPATH references and XSL Parameters. "Field" names (the output node names) are always taken only from the first row, no matter if present as name cells or cell content. This prevents naming inconsistencies in case some rows have cells with different/additional cell names. Also, only data from a single worksheet (in a multi-sheet workbook) is processed, to prevent mixing data from one sheet with unrelated information from a different sheet.

In addition, while the standards define how to declare and use parameters in the stylesheet, they leave the details up to each implementation on how external parameter values may be passed from the outside. The attached script demonstrates how to pass parameters via MSXML's DOM extensions. (I lost an entire day debugging why the Processor interface "output" seemingly was always empty - before I finally caught a small footnote in the Microsoft documentation that referencing the "output" string also clears it!)

Credits

"jemodurn" had posted a related article XSLT to transform Excel XML into CSV files. I am reusing his "book.xml" as input.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)