Introduction
This article shows you how to create an Excel file using Openxml SDK 2.5. There are many articles out there that show you how to create the file using DOM, but I will show you how to create the file using another approach, which is a lot faster and way more memory efficient. This is a complete guide as it shows you how to use this SAX like approach to create a style, a shared string
table, it uses multi-sheet and it shows you how to write different datatypes into the File including the datetime
format.
Background
In one of the projects I have worked on recently, the system uses a 3rd party excel creator that comes with a bunch of other tools that we use and the excel creator is really slow. It creates 15k rows in 17 minutes. During 1 weekend, I explored another alternative by creating Excel file using OpenXml, and came across different solutions. This article will incorporate the stuff that I have learned with hours of Googling about the subject.
There are, of course, 3rd party tools out there that will make the task easier, like ClosedXML
, but I want to focus on creating the Excel from scratch using OpenXml only because I think this will be the speediest way of creating the file. My main focuses are:
- Speed
- Memory usage
Using the Code
This is a single Console project in the solution. The solution is a VS 2012 solution with .NET Framework 4. I have added a nuget package for DocumentFormat.OpenXml
into the project, it adds the DocumentFormat.OpenXml
for me automatically, I have to manually add a reference to the WindowsBase
.
All you have to do is unzip the file and run the console application. It will create an Excel file called test.xlsx for you on your desktop. You might need to manually change the path if you encounter permission problem or start Visual Studio as Administrator.
string path = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);
var fileName = Path.Combine(path, @"test.xlsx");
if (File.Exists(fileName))
{
File.Delete(fileName);
}
I will create 2 sheets for the Excel file. The first sheet will demonstrate writing different datatypes to the file. I have to map .NET Types to Excel types. The .NET Types being created are bool
, int
, DateTime
and string
. The Excel Types will be SharedString
, InlineString
, Boolean
, Number
and Date
(which is really Number
with formatting). I also created 2 sheets to show you how to split data into multiple sheet. Excel 2010 has a maximum of 1,048,576 rows. So if you have data larger than that, you can split the data into multiple sheets.
I defined .NET arrays with different types at the beginning of the Main()
then set up different parts for the Excel file. An Excel file contains only 1 SharedStringTablePart
and 1 WorkbookStylesPart
. We will create them at a different time, we will first create the WorkbookStylesPart
using the helper class.
OpenXmlWriterHelper.SaveCustomStylesheet(WorkbookPart);
SaveCustomStylesheet
calls the CreateDefaultStylesheet()
credit to this article at http://polymathprogrammer.com/2009/11/09/how-to-create-stylesheet-in-excel-open-xml/ to create a default style sheet that is generated by an empty Excel file. From what I Googled, you most likely do not want to change anything created by this default style sheet because Excel reserves certain stuff for default values. You could, however, add formatting or Styles to the default style sheets, just don't delete parts created by the default style sheet.
For example, I have a light blue background for my header rows, to achieve that, I added a Fill
to the stylesheet
.
var fill = new Fill();
var patternFill = new PatternFill();
patternFill.PatternType = PatternValues.Solid;
patternFill.ForegroundColor = new ForegroundColor { Rgb = HexBinaryValue.FromString("C8EEFF") };
fill.PatternFill = patternFill;
fills.AppendChild(fill);
fills.Count = (uint)fills.ChildElements.Count;
I also need to display Date
in a certain format so I added this numbering format to it.
var nfs = stylesheet.NumberingFormats;
uint iExcelIndex = 165;
NumberingFormat nf;
nf = new NumberingFormat();
nf.NumberFormatId = iExcelIndex++;
nf.FormatCode = @"[$-409]m/d/yy\ h:mm\ AM/PM;@";
nfs.Append(nf);
nfs.Count = (uint)nfs.ChildElements.Count;
In order for your cell to reference these special format or Style, you will need to create a different CellFormat
to identify the different NumberingFormat
, Font
, Fill
, Border
, Format
... so that your Cell could reference these CellFormat
using StyleIndex
. StyleIndex
is 0
based, with 0
dedicated to the default style. Any custom style you define starts with 1
.
var cfs = stylesheet.CellFormats;
var cf = new CellFormat();
cf.NumberFormatId = nf.NumberFormatId;
cf.FontId = 0;
cf.FillId = 0;
cf.BorderId = 0;
cf.FormatId = 0;
cf.ApplyNumberFormat = true;
cfs.Append(cf);
cf = new CellFormat();
cf.NumberFormatId = 0;
cf.FontId = 0;
cf.FillId = 2;
cf.ApplyFill = true;
cf.BorderId = 0;
cf.FormatId = 0;
cfs.Append(cf);
cfs.Count = (uint)cfs.ChildElements.Count;
Then, I simply save the Style
using the simple DOM approach because stylesheet
should be relatively small compared to the cell values.
var workbookStylesPart = workbookPart.AddNewPart<WorkbookStylesPart>();
var style = workbookStylesPart.Stylesheet = stylesheet;
style.Save();
After I create the Stylesheet
, I create a Workbook
and add the Sheets
element to it, then I begin adding Sheet1
to the Sheets
class. So far, everything is DOM.
var workbook = workbookPart.Workbook = new Workbook();
var sheets = workbook.AppendChild<Sheets>(new Sheets());
var worksheetPart = workbookPart.AddNewPart<WorksheetPart>();
var sheet = new Sheet() { Id = workbookPart.GetIdOfPart(worksheetPart), SheetId = 1, Name = "Sheet1" };
sheets.Append(sheet);
Now comes the fun, I create an OpenXmlWriter
using the WorksheetPart
, then begin writing XML elements into the file. Before I go into details, I need to explain what is an xlsx file. An xlsx file is basically a zip file containing a bunch of XML files in a certain folder structure along a bunch of uninteresting things. You can actually right click an xlsx file and unzip the file into a folder. Then you'll see the a folder called xl. Inside this folder, you might find the following:
styles.xml contains all the styles and formatting for the cells in the xlsx file, sharedStrings.xml contains all the cell values for cell that are specified as shared string
s.
Inside the worksheets folder, you will find 0 or more XML files; each file represents a worksheet. Most of essential information is defined in these files.
You may create a simple Excel file and examine the content of these files. You might find the following content in the part of the file:
<sheetData>
<row>
<c t="b"><v>1</v></c>
<c t="inlineStr"><is><t>this is an inline string</t></is></c>
<c t="s"><v>0</v></c>
<c t="n"><v>2</v></c>
<c s="1"><v>42053</v></c>
</row>
</sheetData>
These elements actually represent a worksheet with 1 row and 5 cells.
- The 1st cell is of
DataType
of Boolean
and the value is TRUE
. - The 2nd cell is an inline
string
with the value of "this is an inline string
". - The 3rd cell is a shared
string
type and the value is stored in the 0
index in the shared string
table. - The 4th cell is a number type and the value is
2
. - The 5th cell is a cell with style index of
1
and with the value of 42053
, the style index and the value translate this into a date time value.
This is what the OpenXmlWriter
is all about. It has the ability to translate strongly typed OpenXmlElement
into XML files and tags.
To create XML and the tags without loading an entire DOM, you will need to create an OpenXmlWriter
. I create the instance of it, then add the necessary tags using WriteStartElement()
or WriteElement()
.
using (var writer = OpenXmlWriter.Create(worksheetPart))
{
writer.WriteStartElement(new Worksheet());
writer.WriteStartElement(new SheetData());
To create a Row
Tag, you can simply do:
writer.WriteStartElement(new Row());
writer.WriteEndElement();
There are different overloads for the WriteStartElement
, you can actually pass in the tag <Row>
instead of using new Row()
if you do not want to create so many objects at run time. This could possibly reduce the memory foodprint of your application. Feel free to play around with it if you have time.
I like to look at code without having to download anything, that's why I'm posting important pieces of my code here. The code below writes out the cell (<c>
) and its children differently depending on what CellValues
you specify. It takes the OpenXmlWriter
instance, a CellValues datatype
and a list of OpenXmlAttribute
.
public void WriteCellValueSax(OpenXmlWriter writer, string cellValue,
CellValues dataType, List<OpenXmlAttribute> attributes = null)
{
switch (dataType)
{
case CellValues.InlineString:
{
if (attributes == null)
{
attributes = new List<OpenXmlAttribute>();
}
attributes.Add(new OpenXmlAttribute("t", null, "inlineStr"));
writer.WriteStartElement(new Cell(), attributes);
writer.WriteElement(new InlineString(new Text(cellValue)));
writer.WriteEndElement();
break;
}
case CellValues.SharedString:
{
if (attributes == null)
{
attributes = new List<OpenXmlAttribute>();
}
attributes.Add(new OpenXmlAttribute("t", null, "s"));
writer.WriteStartElement(new Cell(), attributes);
if (!_shareStringDictionary.ContainsKey(cellValue))
{
_shareStringDictionary.Add(cellValue, _shareStringMaxIndex);
_shareStringMaxIndex++;
}
writer.WriteElement(new CellValue(_shareStringDictionary[cellValue].ToString()));
writer.WriteEndElement();
break;
}
case CellValues.Date:
{
if (attributes == null)
{
writer.WriteStartElement(new Cell() { DataType = CellValues.Number });
}
else
{
writer.WriteStartElement(new Cell() { DataType = CellValues.Number }, attributes);
}
writer.WriteElement(new CellValue(cellValue));
writer.WriteEndElement();
break;
}
case CellValues.Boolean:
{
if (attributes == null)
{
attributes = new List<OpenXmlAttribute>();
}
attributes.Add(new OpenXmlAttribute("t", null, "b"));
writer.WriteStartElement(new Cell(), attributes);
writer.WriteElement(new CellValue(cellValue == "True" ? "1" : "0"));
writer.WriteEndElement();
break;
}
default:
{
if (attributes == null)
{
writer.WriteStartElement(new Cell() { DataType = dataType });
}
else
{
writer.WriteStartElement(new Cell() { DataType = dataType }, attributes);
}
writer.WriteElement(new CellValue(cellValue));
writer.WriteEndElement();
break;
}
}
}
OpenXmlAttribute
is used to write out attribute for the XML element. For example, to write out the "t
" attribute in <c t="inlineStr">
, you can do:
attributes.Add(new OpenXmlAttribute("t", null, "inlineStr"));
writer.WriteStartElement(new Cell(), attributes);
Attribute
is used in my case to specify formatting for the cell, e.g., specify a Cell to have certain StyleIndex
in which you have defined in the custom stylesheet
. I use it in the project to indicate if a cell should have certain background color or should have certain datetime
formatting.
String vs InlineString vs SharedString
Some of you might not know the difference between these CellValues enum
values, so I'm going to throw this part in also.
String
should be used to represent a formula in the cell (I didn't really handle this case) InlineString
will be treated as rich text SharedString
represents a string
that is shared between all cell/sheet in an Excel file. According to MSDN: A workbook can contain thousands of cells containing string
(non-numeric) data. Furthermore, this data is very likely to be repeated across many rows or columns. The goal of implementing a single string
table that is shared across the workbook is to improve performance in opening and saving the file by only reading and writing the repetitive information once.
I have tested creating 1 million rows using InlineString
vs SharedString
and find that if the data length is small and if it doesn't have many repeated values, the Excel file generated using InlineString
is a lot smaller, 19.7 MB vs 26 MB. However, if the data repeats itself many times, using SharedString
is actually more efficient. In the case that I have tested, 19.7 MB vs 15MB using 149000 unique 5 letters string for 4 millions cells. You will see the benefit of using shared string
if the string
s are long and if they are repetitive.
For the SharedString
implementation, I simply used a Dictionary<string, int>
where the key is the value of the cell and the value of the dictionary is the index the value contains in the SharedStringTable
.
private readonly Dictionary<string, int> _shareStringDictionary = new Dictionary<string, int>();
private int _shareStringMaxIndex = 0;
Microsoft has a different way of implementing inserting SharedString
into the SharedStringTable
, but I believe it will be a lot slower than the way I implemented it here. The only downside of my approach is memory usage of the Dictionary
. Ultimately, you can choose to use Microsoft's approach instead if memory is an issue for you.
Since all the cell Values are kept in the Dictionary
, we need to write out all the values into the sharedStringTable
after we have finished writing to the sheet. I chose to use OpenXmlWriter
again for this part.
public void CreateShareStringPart(WorkbookPart workbookPart)
{
if (_shareStringMaxIndex > 0)
{
var sharedStringPart = workbookPart.AddNewPart<SharedStringTablePart>();
using (var writer = OpenXmlWriter.Create(sharedStringPart))
{
writer.WriteStartElement(new SharedStringTable());
foreach (var item in _shareStringDictionary)
{
writer.WriteStartElement(new SharedStringItem());
writer.WriteElement(new Text(item.Key));
writer.WriteEndElement();
}
writer.WriteEndElement();
}
}
}
Conclusion
Depending on the configuration, on my machine, it takes about 11 seconds to 19 seconds to write out 4 million cell values. I would say that's fast.
Additional credits go to: