Introduction
When I was writing one of the projects (multi-format data parser) I had to deal with Microsoft Excel files.
First of all I tried to read data via COM interop, but this way was very slow (and sometimes input files have up to 100,000 rows and even more) and required Excel to be installed on the target machine.
Then I tried to read data via Excel's OleDb driver. When DataReader ran through data rows without reading them (i.e. while (rd.Read());
) it was quite fast, but retrieving data was still very slow.
When I had tried out one of commercial products (TMS FlexCel Studio), it showed rather good results (several seconds for loading 17 MB file, 2 sheets, ~ 80 000 rows), but access to them was not very fast (at least not suitable for displaying data using DataGridView in virtual mode).
So, I decided to write my own reader for Excel files.
Background
First of all I decided that the reader will produce a DataSet
object, with sheets represented as DataTable
s within it. Access to DataTable
is rather fast, and its TableName
property is suitable for sheet title.
To read an Excel file, we need to get BIFF stream from xls OLE2 container. Working with OLE2 container is implemented in the following classes:
XlsHeader
(file header representation) XlsFat
(support for file system in OLE2 container) XlsRootDirectory
(directory tree representation) XlsDirectoryEntry
, and XlsStream
(file stream)
Note that implementation of OLE2 container object model is limited (e.g. I didn't write any MiniFAT support code) and Excel-specific (XlsHeader
supports only MS Excel header field values, any other will produce InvalidHeaderException
).
Now we can parse BIFF stream with spreadsheet data. For handy reading data from stream we have a XlsBiffStream
class derived from XlsStream
. It provides a set of methods for reading XlsBiffRecord
common structures. And then, each structure is converted to the appropriate type. Note that only a few record types are supported. All other structures are read as basic XlsBiffRecord
and are ignored.
My implementation of reader ignores any text formatting, graphics, embedded objects and charts. Only cell values are read. When a cell contains a formula, the last calculated formula value is used as the cell value. Also, only indexed sheets are read (i.e. sheets with index data) - I didn't write not-indexed sheets support due to my laziness (I didn't come across any file without an index yet).
Using the code
Now, it's very easy to use the parser. All you need is a Stream
with an Excel file.
FileStream fs = new FileStream(@"c:\file.xls",
FileMode.Open, FileAccess.Read);
ExcelDataReader rd = new ExcelDataReader(fs);
fs.Close();
DataSet data = rd.WorkbookData;
Points Of Interest
Latest versions of Excel use SST (Shared String Table) to store string information, and text cells only reference string by index. Also, the size of one BIFF section is limited, and SST can be rather big, so it uses the so-called CONTINUE sections. Every string in SST can be saved as Unicode (2 byte) or as ANSI (single byte per symbol). But when the string is broken with the continue section, it can change its encoding from one to another! This took me a lot of debugging, and I wish nobody has to do the same again...
Results
This implementation of reader is rather nonoptimal (I was pressed for time), but it showed a good speed. It surely beats any interop or OLE method and even some commercial analogs.