Introduction
This article is in response to an under documented feature of the Microsoft XLSB file format regarding reading string data. This article sheds light on how the user can understand how to write code to read the XLSB string data. The purpose of this article is to show the reader how to calculate the record size
of a record type in the XLSB file format.
Background
This is a very advanced topic. It requires that the user has already written a substantial amount of code that can handle the XLSB file format. The XLSB file format is really a ZIP archive that contains many files with many types of information. For this article we are only
interested in the worksheet file information.
Using the code
This article doesn't use code or even pseudo-code. It simply sheds light on the data structure of the string record.
Points of Interest
This article is for anyone who has ever had to write readers or converters of the Excel file format. Currently I have written or worked on XLS 95, XLS 97-2003, XLSX, and XLSB. I have these formats under my belt and consider myself an expert on many aspects of the file format though I am not much of a user or expert of the software itself. In my day to day employment I have to support the Excel file format.
The data structure
A record in the XLSB file format is a smart data structure. It allows for future record types and support for legacy record types. I find the record format quite advanced and smartly designed. "Each binary record is a variable-length sequence of bytes. A binary record consists of three components: a record type, a record size, and the record data that is specific to that record type." -- From the documentation XLSB documentation.
- The record type is variable length up to 2 bytes and is combined as 7 bit numbers.
- The record size is also variable length up to 4 bytes and is combined as 7 bit numbers.
- The record data follows and its length, in bytes, is the record size.
When reading the data one can skip all the record types of the data they are not interested in and parse those records they are interested in. String data comes in many formats. It can exist in a string table or as a stand alone record. Because the string data is variable length in nature it is extremely important to be able to precisely calculate the record size in order to correctly read the record data. At the end of the article a formula is given that can be used to calculate any record length of any record type. This article will examine this with the context of string data.
The documentation gives a poor description and example of how to process a small string and they give the formula B2 * 128 + B1. B1 is the first byte but using
only the first 7 bits plus the second byte multiplied by 128, but only using the first 7 bits of the data. With this formula we can only read strings totaling 127 * 128 + 127 in length.
This works out to 16,383 bytes of data. Since the strings are stored as Unicode data this length is really halved and the string is limited to 8,191
Unicode characters.
This is probably sufficient for most cells of data.
What do we do when we have a string longer than that? If we follow the documentation we come away wanting more examples and documentation. However, we are left wanting.... What are we to do? In my case I had to debug a crashing application because it was trying to read in strings that are larger than this size. Fortunately I was able to reverse engineer the process and package it up in a nice article for future generations of coders who want to read Excel data.
A string comes in many different record types but underneath it all is a basic string representation. For this example we are looking at record type 62. This record is formally known as
the BRTCELLSTRING
and it is one of the record types that Excel uses when transferring clipboard data between applications. The record contains cell and string information. This record is given by specifications 2.4.282, 2.5.9, and 2.5.128 of the file format. In each image the left most "0" represents the first bit and the rightmost "1" represents the 32 bit of the data structure. The components are stacked to logical order and size. Some bits are not used and are marked as reserved.
Let's look at the crashing application and the string it was trying to read. We know that the format of the string. See
BrtCellRString
above. The first portion is a cell. See Cell above. The next portion is the value. See
RichStr
above.
Now we need to read the data dwSizeStrRun
in the
RichStr
(2.5.128) structure which tells us the length of the string. Theoretically it should be possible to reverse engineer the overarching record size using just the string length element since the rest of the data is a fixed length.
Looking at the structures again we see that the Cell is always 8 bytes, 4 bytes for the column and 4 for the style. The value is variable length with 1 byte for information, 4 bytes for the length
dwSizeStrRun
, and then the string data.
Now we can read the string length and compute the record size and make sense of the record size bytes.
In the crash I was analyzing
dwSizeStrRun
contained the value 12402. Using the knowledge of the record and knowing this string is not phonetic we can compute the value that the record size should tell us.
Starting by knowing that the string is a Unicode string we must double this amount to get the length in bytes. Therefore 12402 * 2 = 24804. Now we add 13 to this to account for the size of the fields containing
dwSizeStrRun
(4 bytes), A and B (1 byte), iStyleRef
and A (4 bytes), and column (4 bytes). We get an expected record size of 24817 bytes.
Now, in my sample data the record size of the overarching file is given by the 3 bytes 241, 193, and 1. Once again how did we know we only have 3 bytes to determine the record size? We know this because any byte of the record size which is greater than or equal to 128 means that another byte follows. We read, based on this logic, for up to 4 bytes. In the example the 3rd byte has the value of 1 so we stop.
How in the world do the bytes 241, 193, and 1 get translated into the record size of 24817? It's easy if you know how...
- Step 1 - The documentation tells us that only the first 7 bits are important in the value so for values greater than or equal to 128 we subtract 128. This gives us the values 113, 65, and 1
- Step 2 - We are told that the first byte is the least significant byte and the last byte is the most significant byte so we reverse the order. This gives us the sequence of 1, 65, and 113.
- Step 3 - We are told that these are 7 bit numbers so let's look at the bits, in order of the sequence above
1 = 0000001, 65 = 1000001, 113 = 1110001.
Combined they are 000000110000011110001
Now let's look at the binary value of 24817.
24817 = 110000011110001
Let's
compare our two binary numbers: 110000011110001 == 110000011110001
Voila! They are the same so now we know how to manipulate the record size bytes to match with the string length. Those record size bytes, once manipulated form a BCD value. BCD is a binary coded decimal. Let's turn this information into a formula. The formula must ignore the 8th bit and must concatenate the (at most) 4 bytes together to make a record length. The 8th bit is 128 and all the other bits of interest
combine to the value of 127. We can use "And" logic to mask out the high bit by only looking at the bits of interest. We can use "bit shift" logic to position our bits for our binary coded decimal value. We can use "Or" logic to stitch everything back together.
Here goes our formula:
Record Size = ((B4 & 127) << 21) | ((B3 & 127) << 14) | ((B2 & 127) << 7) | (B1 & 127)
I doubt you will get the same gratification out of reverse engineering a file format to fill in the gaps of documentation that I needed to do for my work. However, I hope this formula really helps someone and my purpose for writing this article was to document an under documented but extremely critical feature of the XLSB file format.
Cheers! - Andy
History
- 11/7/2013 - The article was penned by me.
- 11/8/2013 - Edited to add more text and fix the formula because I missed a parenthesis.