Introduction
I used to have problems with some Access 97 databases that went corrupt. Also I had instalation problems with the JET Engine for MsAccess. I found MdbTools, and could repair databases or at least get data out. Then I ported MdbTools to MFC and .NET (only read, not write support). I updated the sources to the latest version of mdbtools.
The MsAccess file structure
Pages
All the info of the mdb file is organized as pages. A page is a region in the file of 2k (Jet3) and 4k (Jet4).
So, the file len of MsAccess databases is a multiple of 2,048 (Jet3) or 4,096 (Jet4). For example, the Nwind.mdb file sample contains 1,548,288 / 2048 = 756 pages.
All pages are stored secuentially, and, as they are all the same size, page n starts at n * 2048 + 1 or n * 4096 + 1 in the file (considering n starting at 0).
For example, for a Jet3 database containing 5 pages (including 0 which is the Database Definition page):
Positions differ from Jet3 to Jet4, so all tables will be referred to Jet3.
Page Number | Starts at position (dec) | Size (dec) |
0 | 0 | 2048 |
1 | 2048 | 2048 |
2 | 4096 | 2048 |
3 | 6144 | 2048 |
4 | 8192 | 2048 |
Page 0 - Database Definition Page
There is not much information about this page. What is known:
From (hex) | To (hex) | Len (dec) | Format | Description |
0x0 | 0x0 | 1 | Byte | Page type - For the database definition page type is 0 |
0x1 | 0x13 | 19 | | Unknown |
0x14 | 0x17 | 4 | LE-Int | Jet version of the database (0 for Jet3, 1 for Jet4, 2 for Jet5, 3 for Access 2010 |
0x18 | 0x3D | 38 | | Unknown |
0x3E | 0x41 | 4 | LE-Int | RC4 Key used to encrypt pages (if the database is encrypted) |
0x42 | 0x50 | 14 | Byte | Password (masked) |
0x51 | 0x7FF | 1967 | | Unknown |
LE-Int: Refers to Little Endian Int which means that the number is stored from the least significant bit to the most significant bit (LSB-MSB). For example, the number 0xA1B2C3D4 is stored as: D4 C3 B2 A1.
The first byte of each page identifies the page type:
- 0x00 Database definition page. (Always page 0)
- 0x01 Data page
- 0x02 Table definition
- 0x03 Intermediate Index pages
- 0x04 Leaf Index pages
- 0x05 Page Usage Bitmaps (extended page usage)
Page 1
Keeps track of allocated pages in the database.
Page 2 - Table definition page for MSysObjects
MsAccess databases include system tables that contain information about the database itself. They all start with the prefix MSys.
Page 2 contains the Table definition (Page type 0x02) of the table MSysObjects. This table contains all the objects names and (more important), the page number location of them.
The structure of a Table Definition page is:
From (hex) | To (hex) | Len (dec) | Format | Description |
0x0 | 0x0 | 1 | Byte | Page type - For the table definition page type is 2 |
0x1 | 0x7 | 8 | | Unknown |
0x8 | 0x9 | 2 | LE-Int | Page len |
0xC | 0xF | 4 | LE-Int | Number of rows |
0x17 | 0x18 | 2 | LE-Int | Number of variable columns |
0x19 | 0x1A | 2 | LE-Int | Number of columns |
0x1B | 0x1E | 4 | LE-Int | Number of indexes |
0x1F | 0x22 | 4 | LE-Int | Number of index entries |
0x23 | 0x26 | 4 | | Pages usage bitmask |
0x27 | 0x2A | 4 | LE-Int | Pages with free space bitmask |
Column definitions start at 0x2B + (Number of index entries * 8). Column properties go first. Then go column names (in the same order that properties).
Column properties have a fixed size of 18 bytes.
Assuming "Number of index entries" = 2 (for MSysObjects), column properties start at 0x3B. The number of columns was retrieved before, so you must read 18 * numcols bytes.
From (hex) | To (hex) | Len (dec) | Format | Description |
0x3B | 0x3B | 1 | Byte | Column type |
0x3C | 0x3C | 1 | Byte | Column Number |
0x3D | 0x3E | 2 | LE-Int | Variable col number |
0x3F | 0x40 | 2 | LE-Int | Row col number |
0x41 | 0x45 | 5 | | Unknown |
0x46 | 0x46 | 1 | Byte | Precision |
0x47 | 0x47 | 1 | Byte | Scale |
0x48 | 0x48 | 1 | Byte | Flags. Include isfixed, is long autonumeric, uuid auto |
0x49 | 0x4A | 2 | LE-Int | Column fixed offset |
0x4B | 0x4C | 2 | LE-Int | Column size offset |
After the properties, the name of the columns appear. The first byte contains the size and then you read that amount of bytes (the name). So, column names canĀ“t be bigger than 256 chars
Column names are stored in compressed unicode.
For example, for MSysObjects there are 17 columns, so 0x3B adds to 17 * 18 = 16D The first column is Id (len 2)
From (hex) | To (hex) | Len (dec) | Format | Description |
0x16D | 0x16D | 1 | Byte | Column name size |
0x16E | 0x16F | 2 | Unicode Text | Column Name |
After the columns, the indexes definition follow.
Data allocation
Data is stored in data pages (page type 1) with the structure:
From (hex) | To (hex) | Len (dec) | Format | Description |
0x0 | 0x0 | 1 | Byte | Page Type (1 for this page) |
0x1 | 0x1 | 1 | Byte | Unknown |
0x2 | 0x3 | 2 | LE-Int | Free space in this page |
0x4 | 0x7 | 4 | LE-Int | Page pointer to page definition |
0x8 | 0x9 | 2 | LE-Int | Number of records |
After that information, there are offsets of 2 bytes to records (there are 2 * num_records bytes of this items). After that, records start.
To know record len you substract next offset to previous offset.
In the table definition there is a field called
Pages usage bitmask:
The 3 first bytes of that item are the page and the last byte the row of the page usage bitmap.
For example, in the sample database, the bitmap is located in Page 6, row 0.
Bitmap data is 133 bytes long and contain:
From (hex) | To (hex) | Len (dec) | Format | Description |
0x0 | 0x0 | 1 | Byte | Map type (0 or 1) |
0x1 | 0x4 | 4 | Byte | Base page for which this map applies |
0x5 | 0x85 | 128 | Byte | Each bit contains allocation of this table (boolean). Allocation can be set to 128 pages after base page |
So, this basic allocation cannot hold more than 128 pages. As pages are of size 2048, that gives: 128 * 8 * 2048 = 2Mb. For bigger databases the field
Pages usage bitmask points to a record containing an map identifier (1) of 1 byte and secuentially pointers of 4 bytes to pages of type 5 that contain bitmaps.
The code
Solution structure
The solution contains 5 projects:
- MdbToolsLib: Contains the GLib (C Libraries) and libmdb (Access). It also contains classes to interface with projects MFCMdbTools and NETMdbTools. It uses STL.
- MFCMdbTools: The idea of this Dll is to act as a layer between MFC and MdbToolsLib. Most of the code is CString conversion
- NETMdbTools: The idea of this Dll is to act as a layer between .NET and MdbToolsLib. It uses C++/CLI.
- MFCMdbToolsTestApp: MFC test app that uses MFCMdbTools.
- NETMdbToolsTestApp: C# test app that uses NETMdbTools. It includes export functionality.
Naming conventions
There are 3 projects which share the same functionality (MdbToolsLib, MFCMdbTools, NETMdbTools) for different libraries (C/C++, MFC, .NET).
So, there are 3 prefixes, one for each library:
- MdbToolsLib: The prefix is MdbLib. For example, MdbLibDatabase
- MFCMdbTools: The prefix is MFCMdb. For example, MFCMdbDatabase
- NETMdbTools: The prefix is NETMdb. For example, NETMdbDatabase
MdbToolsLib
CMdbLibDatabase() | Initializes the MdbTools handler as 0 so that you can check if a database is loaded. |
void* mdb; | Private variable that contains the MdbHandle (struct) of the database. It is declared as void so that you don't need the MdbTools.h header. |
int Open(const char* strname); | Opens the database and assigns Mdbhandle. Returns a 0 for error, other for success. |
void Close(); | If there is an open database, it closes all tables handles and closes the Mdbhandle database. |
void LoadTables(); | Loads the database catalog and creates CMdbLibTable objects for each one in the tables array. |
map<char*, CMdbLibTable, cmp_str> tables; | This is the array of tables. cmp_str is a comparison to find items sorted by char* |
CMdbLibTable() | Initializes the MdbTableDef object (tdef) and the name (char*) as 0. |
char* Name; | Contains the name of the table. It is allocated and assigned in CMdbLibDatabase::LoadTables() and deleted in ClearTableColumns() |
map<char*, CMdbLibColumn, cmp_str> columns; | This is the array with the columns of the table. It is loaded in LoadTableColumns() |
map<char*, CMdbLibValue, cmp_str> cvalues; | This is the array with the values for iterating the table rows. It is loaded and allocated in LoadTableColumns() but values are assigned in NextItem() |
void LoadTableColumns(); | Loads columns and values (columns and cvalues arrays). Allocates memory for both. Binds cvalue data to col so that it is assigned when row changes |
void ClearTableColumns(); | Frees the MdbTableDef tdef variable, and all cvalues allocated space for row data. |
int NextItem(); | Executes the mdb_fetch_row and returns if eof. Values are assigned inside mdb_fetch_row by the bind_ptr. |
void MoveFirst(); | Executes the mdb_rewind_table so that record cursor moves to the first row. |
void* mdb; | MdbHandle variable of the database. Declared as void to avoid Mdbtools.h dependency. |
void* tdef; | This is the MdbTools handle for the table. It is assigned in LoadTableColumns(). |
BOOL IsSystem; | If the table is system table (name starts with MSys). |
void* GetEntryByName(char* name); | Finds the entry in the catalog by table name. This entry is neccesary to perform the mdb_read_table. |
enum MdbLibColType | Contain the column type constants. It mimics MdbTools column type constants. They exist to provide independence from MdbTools.h. |
char* Name; | Column name. It is allocated and assigned in CMdbLibTable::LoadTableColumns(). |
MdbLibColType Type; | Column type. |
int Size; | Column fiexd size. |
int IsInt(); | Boolean used to group all integer types for formatting porpuses. |
int IsNumDec(); | Boolean used to group all float types for formatting porpuses. |
int IsBinary(); | Boolean used to group all binary (memo, OLE, etc) types for formatting porpuses. |
int IsDate(); | Returns if the column type is date (for formatting porpuses). |
char* value; | This is the buffer where the column value is stored when iterating rows. It is assigned to bind_ptr in CMdbLibTable::LoadTableColumns(). |
int len; | This is real len of the data stored in value. It is assigned to len_ptr in CMdbLibTable::LoadTableColumns(). |
int GetIntValue(); | Returns the value stored in "value" variable as int. |
double GetDoubleValue(); | Returns the value stored in "value" variable as double. To do so it eliminates the exponent and multiplies accordingly. |
void GetTimeValue(SYSTEMTIME* st); | Returns the date in SYSTEMTIME format. |
Libraries
The original MdbTools uses GLib, so, there are dependencies and libglib is needed to run the application.
I had the idea of replacing GLib with STL and native windows functions, but then, it would be a lot of work and besides, when MdbTools sourcecode improves, I would have a difficult time upgrading.
The project is in VS 2010 and uses the latest MFC and .NET 4 but it can be easily ported the older versions of Visual Studio.
To compile MFC and .NET apps, first compile GLib and copy the .lib in the glib folder of the app.
History
- 2012-01-13: Improve article with some MsAccess format detail.
- 2011-12-20:
- Made a common .lib interface (including GLib) for both .NET and MFC and put it all into one single project.
- Added support for Binary fields (OLE). Improved support for numbers and dates
- In the .NET app added export to DAO database to repair corrupt databases
- 2011-11-21. GLib library now links into dlls.
- 2011-11-14. First version that uses glib dlls.
Licence and sourcecode
The licence for this project is LGPL which basically means that you can use the library in any project (even commercial apps) but if you make modifications to the library itself you must publish them. The sources were taken from https://github.com/brianb/mdbtools.
To do
- Add support for other database objects (queries, modules, forms, etc).
- Add ATL/COM interface for MdbTools.
- Add write support.
Acknowledgements
To the authors of mdbtools: Brian Bruns, Karl Nyberg, Georg Bauer, Carl Seutter, Trevor Harrison, Brent Johnson, Tim Nelson, David Mansfield, Jeff Smith, Steve Langasek, Rene Engelhard, Vincent Fourmond, Tim Retout, Nirgal Vourgere