Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C#

How to Create Large Excel File using Openxml

4.92/5 (12 votes)
19 Feb 2015CPOL8 min read 92.2K   4.5K  
Create Excel File using Openxml without loading the entire DOM

Introduction

This article shows you how to create an Excel file using Openxml SDK 2.5. There are many articles out there that show you how to create the file using DOM, but I will show you how to create the file using another approach, which is a lot faster and way more memory efficient. This is a complete guide as it shows you how to use this SAX like approach to create a style, a shared string table, it uses multi-sheet and it shows you how to write different datatypes into the File including the datetime format.

Background

In one of the projects I have worked on recently, the system uses a 3rd party excel creator that comes with a bunch of other tools that we use and the excel creator is really slow. It creates 15k rows in 17 minutes. During 1 weekend, I explored another alternative by creating Excel file using OpenXml, and came across different solutions. This article will incorporate the stuff that I have learned with hours of Googling about the subject.

There are, of course, 3rd party tools out there that will make the task easier, like ClosedXML, but I want to focus on creating the Excel from scratch using OpenXml only because I think this will be the speediest way of creating the file. My main focuses are:

  1. Speed
  2. Memory usage

Using the Code

This is a single Console project in the solution. The solution is a VS 2012 solution with .NET Framework 4. I have added a nuget package for DocumentFormat.OpenXml into the project, it adds the DocumentFormat.OpenXml for me automatically, I have to manually add a reference to the WindowsBase.

Image 1

All you have to do is unzip the file and run the console application. It will create an Excel file called test.xlsx for you on your desktop. You might need to manually change the path if you encounter permission problem or start Visual Studio as Administrator.

C#
string path = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);
var fileName = Path.Combine(path,  @"test.xlsx");
if (File.Exists(fileName))
{
    File.Delete(fileName);
}

I will create 2 sheets for the Excel file. The first sheet will demonstrate writing different datatypes to the file. I have to map .NET Types to Excel types. The .NET Types being created are bool, int, DateTime and string. The Excel Types will be SharedString, InlineString, Boolean, Number and Date (which is really Number with formatting). I also created 2 sheets to show you how to split data into multiple sheet. Excel 2010 has a maximum of 1,048,576 rows. So if you have data larger than that, you can split the data into multiple sheets.

I defined .NET arrays with different types at the beginning of the Main() then set up different parts for the Excel file. An Excel file contains only 1 SharedStringTablePart and 1 WorkbookStylesPart. We will create them at a different time, we will first create the WorkbookStylesPart using the helper class.

C#
OpenXmlWriterHelper.SaveCustomStylesheet(WorkbookPart);

SaveCustomStylesheet calls the CreateDefaultStylesheet() credit to this article at http://polymathprogrammer.com/2009/11/09/how-to-create-stylesheet-in-excel-open-xml/ to create a default style sheet that is generated by an empty Excel file. From what I Googled, you most likely do not want to change anything created by this default style sheet because Excel reserves certain stuff for default values. You could, however, add formatting or Styles to the default style sheets, just don't delete parts created by the default style sheet.

For example, I have a light blue background for my header rows, to achieve that, I added a Fill to the stylesheet.

C#
//header fills background color
var fill = new Fill();
var patternFill = new PatternFill();
patternFill.PatternType = PatternValues.Solid;
patternFill.ForegroundColor = new ForegroundColor { Rgb = HexBinaryValue.FromString("C8EEFF") };
//patternFill.BackgroundColor = new BackgroundColor() { Indexed = 64 };
fill.PatternFill = patternFill;
fills.AppendChild(fill);
fills.Count = (uint)fills.ChildElements.Count;

I also need to display Date in a certain format so I added this numbering format to it.

C#
// *************************** numbering formats ***********************
var nfs = stylesheet.NumberingFormats;
//number less than 164 is reserved by excel for default formats
uint iExcelIndex = 165;
NumberingFormat nf;
nf = new NumberingFormat();
nf.NumberFormatId = iExcelIndex++;
nf.FormatCode = @"[$-409]m/d/yy\ h:mm\ AM/PM;@";
nfs.Append(nf);

nfs.Count = (uint)nfs.ChildElements.Count;

In order for your cell to reference these special format or Style, you will need to create a different CellFormatto identify the different NumberingFormat, Font, Fill, Border, Format... so that your Cell could reference these CellFormat using StyleIndex. StyleIndex is 0 based, with 0 dedicated to the default style. Any custom style you define starts with 1.

C#
//************************** cell formats ***********************************
var cfs = stylesheet.CellFormats;//this should already contain a default StyleIndex of 0
       
var cf = new CellFormat();// Date time format is defined as StyleIndex = 1
cf.NumberFormatId = nf.NumberFormatId;
cf.FontId = 0;
cf.FillId = 0;
cf.BorderId = 0;
cf.FormatId = 0;
cf.ApplyNumberFormat = true;
cfs.Append(cf);
       
cf = new CellFormat();// Header format is defined as StyleINdex = 2
cf.NumberFormatId = 0;
cf.FontId = 0;
cf.FillId = 2;
cf.ApplyFill = true;
cf.BorderId = 0;
cf.FormatId = 0;
cfs.Append(cf);
      
cfs.Count = (uint)cfs.ChildElements.Count;

Then, I simply save the Style using the simple DOM approach because stylesheet should be relatively small compared to the cell values.

C#
var workbookStylesPart = workbookPart.AddNewPart<WorkbookStylesPart>();
var style = workbookStylesPart.Stylesheet = stylesheet;
style.Save();

After I create the Stylesheet, I create a Workbook and add the Sheets element to it, then I begin adding Sheet1 to the Sheets class. So far, everything is DOM.

C#
var workbook = workbookPart.Workbook = new Workbook();
var sheets = workbook.AppendChild<Sheets>(new Sheets());


// create worksheet 1
var worksheetPart = workbookPart.AddNewPart<WorksheetPart>();
var sheet = new Sheet() { Id = workbookPart.GetIdOfPart(worksheetPart), SheetId = 1, Name = "Sheet1" };
sheets.Append(sheet);

Now comes the fun, I create an OpenXmlWriter using the WorksheetPart, then begin writing XML elements into the file. Before I go into details, I need to explain what is an xlsx file. An xlsx file is basically a zip file containing a bunch of XML files in a certain folder structure along a bunch of uninteresting things. You can actually right click an xlsx file and unzip the file into a folder. Then you'll see the a folder called xl. Inside this folder, you might find the following:

Image 2

styles.xml contains all the styles and formatting for the cells in the xlsx file, sharedStrings.xml contains all the cell values for cell that are specified as shared strings.

Inside the worksheets folder, you will find 0 or more XML files; each file represents a worksheet. Most of essential information is defined in these files.

You may create a simple Excel file and examine the content of these files. You might find the following content in the part of the file:

XML
<sheetData>
<row>
<c t="b"><v>1</v></c>
<c t="inlineStr"><is><t>this is an inline string</t></is></c>
<c t="s"><v>0</v></c>
<c t="n"><v>2</v></c>
<c s="1"><v>42053</v></c>
</row>
</sheetData>

These elements actually represent a worksheet with 1 row and 5 cells.

  • The 1st cell is of DataType of Boolean and the value is TRUE.
  • The 2nd cell is an inline string with the value of "this is an inline string".
  • The 3rd cell is a shared string type and the value is stored in the 0 index in the shared string table.
  • The 4th cell is a number type and the value is 2.
  • The 5th cell is a cell with style index of 1 and with the value of 42053, the style index and the value translate this into a date time value.

This is what the OpenXmlWriter is all about. It has the ability to translate strongly typed OpenXmlElement into XML files and tags.

To create XML and the tags without loading an entire DOM, you will need to create an OpenXmlWriter. I create the instance of it, then add the necessary tags using WriteStartElement() or WriteElement().

C#
using (var writer = OpenXmlWriter.Create(worksheetPart))
{
    writer.WriteStartElement(new Worksheet());
    writer.WriteStartElement(new SheetData());

To create a Row Tag, you can simply do:

C#
writer.WriteStartElement(new Row()); //<Row>
//...... write your cell tags using OpenXmlWriter
writer.WriteEndElement();  //this will close the Row Tag : </Row>

There are different overloads for the WriteStartElement, you can actually pass in the tag <Row> instead of using new Row() if you do not want to create so many objects at run time. This could possibly reduce the memory foodprint of your application. Feel free to play around with it if you have time.

I like to look at code without having to download anything, that's why I'm posting important pieces of my code here. The code below writes out the cell (<c>) and its children differently depending on what CellValues you specify. It takes the OpenXmlWriter instance, a CellValues datatype and a list of OpenXmlAttribute.

C#
public void WriteCellValueSax(OpenXmlWriter writer, string cellValue, 
CellValues dataType, List<OpenXmlAttribute> attributes = null)
{
    switch (dataType)
    {
        case CellValues.InlineString:
            {
                if (attributes == null)
                {
                    attributes = new List<OpenXmlAttribute>();
                }
                attributes.Add(new OpenXmlAttribute("t", null, "inlineStr"));
                writer.WriteStartElement(new Cell(), attributes);
                writer.WriteElement(new InlineString(new Text(cellValue)));
                writer.WriteEndElement();
                break;
            }
        case CellValues.SharedString:
            {
                if (attributes == null)
                {
                    attributes = new List<OpenXmlAttribute>();
                }
                attributes.Add(new OpenXmlAttribute("t", null, "s"));//shared string type
                writer.WriteStartElement(new Cell(), attributes);
                if (!_shareStringDictionary.ContainsKey(cellValue))
                {
                    _shareStringDictionary.Add(cellValue, _shareStringMaxIndex);
                    _shareStringMaxIndex++;
                }

                //writing the index as the cell value
                writer.WriteElement(new CellValue(_shareStringDictionary[cellValue].ToString()));

                writer.WriteEndElement();//cell

                break;
            }
        case CellValues.Date:
            {
                if (attributes == null)
                {
                    writer.WriteStartElement(new Cell() { DataType = CellValues.Number });
                }
                else
                {
                    writer.WriteStartElement(new Cell() { DataType = CellValues.Number }, attributes);
                }

                writer.WriteElement(new CellValue(cellValue));

                writer.WriteEndElement();

                break;
            }
        case CellValues.Boolean:
            {
                if (attributes == null)
                {
                    attributes = new List<OpenXmlAttribute>();
                }
                attributes.Add(new OpenXmlAttribute("t", null, "b"));//boolean type
                writer.WriteStartElement(new Cell(), attributes);
                writer.WriteElement(new CellValue(cellValue == "True" ? "1" : "0"));
                writer.WriteEndElement();
                break;
            }
        default:
            {
                if (attributes == null)
                {
                    writer.WriteStartElement(new Cell() { DataType = dataType });
                }
                else
                {
                    writer.WriteStartElement(new Cell() { DataType = dataType }, attributes);
                }
                writer.WriteElement(new CellValue(cellValue));

                writer.WriteEndElement();

                break;
            }
    }
}

OpenXmlAttribute is used to write out attribute for the XML element. For example, to write out the "t" attribute in <c t="inlineStr">, you can do:

C#
attributes.Add(new OpenXmlAttribute("t", null, "inlineStr"));
writer.WriteStartElement(new Cell(), attributes);

Attribute is used in my case to specify formatting for the cell, e.g., specify a Cell to have certain StyleIndex in which you have defined in the custom stylesheet. I use it in the project to indicate if a cell should have certain background color or should have certain datetime formatting.

String vs InlineString vs SharedString

Some of you might not know the difference between these CellValues enum values, so I'm going to throw this part in also.

  • String should be used to represent a formula in the cell (I didn't really handle this case)
  • InlineString will be treated as rich text
  • SharedString represents a string that is shared between all cell/sheet in an Excel file. According to MSDN: A workbook can contain thousands of cells containing string (non-numeric) data. Furthermore, this data is very likely to be repeated across many rows or columns. The goal of implementing a single string table that is shared across the workbook is to improve performance in opening and saving the file by only reading and writing the repetitive information once.

I have tested creating 1 million rows using InlineString vs SharedString and find that if the data length is small and if it doesn't have many repeated values, the Excel file generated using InlineString is a lot smaller, 19.7 MB vs 26 MB. However, if the data repeats itself many times, using SharedString is actually more efficient. In the case that I have tested, 19.7 MB vs 15MB using 149000 unique 5 letters string for 4 millions cells. You will see the benefit of using shared string if the strings are long and if they are repetitive.

For the SharedString implementation, I simply used a Dictionary<string, int> where the key is the value of the cell and the value of the dictionary is the index the value contains in the SharedStringTable.

C#
/// <summary>
/// contains the shared string as the key, and the index as the value.  index is 0 base
/// </summary>
private readonly Dictionary<string, int> _shareStringDictionary = new Dictionary<string, int>();
private int _shareStringMaxIndex = 0;

Microsoft has a different way of implementing inserting SharedString into the SharedStringTable, but I believe it will be a lot slower than the way I implemented it here. The only downside of my approach is memory usage of the Dictionary. Ultimately, you can choose to use Microsoft's approach instead if memory is an issue for you.

Since all the cell Values are kept in the Dictionary, we need to write out all the values into the sharedStringTable after we have finished writing to the sheet. I chose to use OpenXmlWriter again for this part.

C#
/// <summary>
/// write out the share string xml.  Call this after writing out all shared string values in sheet
/// </summary>
/// <param name="workbookPart"></param>
public void CreateShareStringPart(WorkbookPart workbookPart)
{
    if (_shareStringMaxIndex > 0)
    {
        var sharedStringPart = workbookPart.AddNewPart<SharedStringTablePart>();
        using (var writer = OpenXmlWriter.Create(sharedStringPart))
        {
            writer.WriteStartElement(new SharedStringTable());
            foreach (var item in _shareStringDictionary)
            {
                writer.WriteStartElement(new SharedStringItem());
                writer.WriteElement(new Text(item.Key));
                writer.WriteEndElement();
            }
            writer.WriteEndElement();
        }
    }
}

Conclusion

Depending on the configuration, on my machine, it takes about 11 seconds to 19 seconds to write out 4 million cell values. I would say that's fast.

Additional credits go to:

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)