Table of Contents
- Introduction
- The Design
- Implementation
I ran into a problem where I needed to store million records of arbitrary data ranging from 1KB-8KB, in the way that I can find, search, read, iterate through my data as quick as possible, just like native disk access.
While Sqlite is very fast at first, I dislike using it due to its inefficient use of space, and it is dramatically slow down overtime when a lot of data is being written and/or deleted. Sqlite also mixes data and indexes into the same file, making linear iteration a horrible experience, especially when records get deleted, it leaves gaps that never get filled until you call the vacuum()
command. See more about sqlite problems in this mozilla article.
So, I want to make a database that is just as fast as Sqlite, but does not slow down overtime, better reusing of deleted space, and does not require vacuum()
.
The full implemented source code can be downloaded in here on github (however, I recommend you to go through the article before looking at the code.
Goals
- I set a very specific goal to make things easier: I made a database to store cows, where the
CowModel
is defined:
public class CowModel
{
public Guid Id { get; set; }
public string Breed { get; set; }
public int Age { get; set; }
public string Name { get; set; }
public byte[] DnaData { get; set; }
}
As you can see, the size of Breed
, Name
and DnaData
properties are unknown, making this project more challenging as the database needs to handle variable-length data.
- The deliverable will be a single C# DLL that we can hook straight in any main project, with the following interface that allows us to interact with the database.
Note that all read/search/delete operations are ad-hoc (i.e., use indexes):
public interface ICowDatabase
{
void Insert (CowModel cow);
void Delete (CowModel cow);
void Update (CowModel cow);
CowModel Find (Guid id);
IEnumerable<CowModel> FindBy (string breed, int age);
}
Scope
This is a simple database that requires only 24-32 hours of work, the main goal is performance and to learn how database works, so you can craft your own.
Features such as ACID complain, transaction/atomic write, multiple users are not included, however the design is opened for these to be implemented.
We will store our data into one single file. Indexes are stored separately, one on each file.
For the database to be able to reuse space for deleted data, instead of seeing the database file as a Stream, we see data as individual blocks of equal size. This guarantees a minimum size to make unused space reusable.
Size of a block is your choice, but typical file system block size is 4KB, where the OS reads and writes data in 4KB block, so have your block size either be 256B, 512B, 1024B, etc.. to make sure your database is aligned with the underlying file system block size.
We call this BlockStorage
. I'll pick 4KB as my block size as my cow DNA data is pretty large.
Content of a block can be anything, a block doesn't care what it stores. It also supports metadata header, so later we can store stuff such as references to another blocks, actual content size of a block being use, and so on..
We will be accessing Block Storage using the following interface:
public interface IBlockStorage
{
int BlockContentSize {
get;
}
int BlockHeaderSize {
get;
}
int BlockSize {
get;
}
IBlock Find (uint blockId);
IBlock CreateNew ();
}
public interface IBlock : IDisposable
{
uint Id {
get;
}
long GetHeader (int field);
void SetHeader (int field, long value);
void Read (byte[] dst, int dstOffset, int srcOffset, int count);
void Write (byte[] src, int srcOffset, int dstOffset, int count);
}
Block Storage does not meet the requirement that is our data is variable in length.
To support variable length records, such as CowModel
where Breed
, Name
, DnaData
properties are unknown in length, we add another layer on top of Block Storage, called Record Storage.
In here, a record is made up from one or multiple blocks, ordered. Another word, a Record
is a linked list of Blocks.
A record may use a full block to store its data, or just part of a block, defined in `length
` field of a block's metadata header. (We'll get to the implementation later.)
A record also has an ID, which is the ID of the first block it is made up from.
A block can be logically marked as deleted, using one of the block's metadata header field. Also, we will be using a special record, record #0 (the first record) to store a stack of deleted block IDs. Deleting a record will be done by logically marking its blocks as deleted, then "push" the IDs of those blocks to record #0. This allows the space of unused blocks tobe reclaimed when creating new record, by reading record #0 and "pop" out a free ID (if there is one) and reuse it.
The use of Record Storage is as follows:
public interface IRecordStorage
{
void Update (uint recordId, byte[] data);
byte[] Find (uint recordId);
uint Create ();
uint Create (byte[] data);
uint Create (Func<uint, byte[]> dataGenerator);
void Delete (uint recordId);
}
Indexing allows searching of records based on its properties (i.e., find a cow by Id
, find cows by breed
); of course you can iterate through records to look for what you want, but it would be very inefficient when you have more than hundred records.
B-Tree is the most popular data structure to build an index in a database. If you are unfamiliar with B-tree or Tree structure in general, you can visualize its benefits as a List<Tuple<K, V>>
, sorted by K
, where K
is the indexing key (i.e. Cow
's ID
or Breed
), and V
is ID
of records in the RecordStorage
mentioned above. This enables you to do a BinarySearch
on key K
and instantly return your position of V
, regardless of how big the list is.
For more information about the B-Tree or Tree structure, I recommend reading about Binary Tree, Binary Search Tree, Self-Balance Binary Search Tree, Red Black Tree, B-Tree.
For now, all we need to know is that the indexes store Key-Value pairs, where Key
is our CowModel
's properties, and Value
is the record ID of our data. It does not only allow instant lookup of Value
by Key
(like a Dictionary
), but also finds a position of an arbitrary key K
and iterates from K
to all other keys either with ascending or descending order.
B-Tree is made up of individual nodes, each node keeps references to other nodes. We store the index by serializing each node into a record, utilizing or RecordStorage
designed above. Each index on its own file.
When we need to access a Node
, we read its binary content and deserialize into a Node
, make changes to it, and then serialize back to byte array and get them saved back into a record.
For best performance, a Node
should fit perfectly into a 4KB block. We can tune number of entries per Node
to make it fit.
The use of Index
is as follows:
public interface IIndex<K, V>
{
void Insert (K key, V value);
Tuple<K, V> Get (K key);
IEnumerable<Tuple<K, V>> LargerThanOrEqualTo (K key);
IEnumerable<Tuple<K, V>> LargerThan (K key);
IEnumerable<Tuple<K, V>> LessThanOrEqualTo (K key);
IEnumerable<Tuple<K, V>> LessThan (K key);
bool Delete (K key, V value, IComparer<V> valueComparer = null);
bool Delete (K key);
}
We will put IRecordStorage
and one or more IIndex
together to make up a complete database. Since these two interfaces are implemented, there is nothing much left to do.
- Inserting a record to a database is done by creating a record in
IRecordStorage.Create()
that returns a uint ID
, then this ID
is used with IIndex
for indexing. - Updating a record also updates relevant
IIndex
. - Deleting a record also deletes off its entry from
IIndex
. - Searching for a record will use
IIndex
whenever possible. When none of the indexes can be used, we will iterate through all primary index entries (full index scan).
The source code is attached. You can also find my implementation here on github.
Let's call the database "FooDB
", where the main source code located in FooCore, along with its unit tests FooTest, and our Cow
application that uses FooDB
to make its CowDatabase
is FooApplication.
Interesting classes to look at:
- FooCore/BlockStorage.cs
- FooCore/Block.cs
As BlockStorage
has no dependencies, the implementation is easy and straight forward.
Creating New Blocks
BlockStorage
initializes with a Stream
as its only dependency. BlockStorage
maintains the length of the stream
to be multiple of specified blockSize
when you initialize it. When a client asks to create a new block, it simply extends the length of the underlying Stream
.
Finding a Block
- Each Block is identified by an unique id, which is also its position within the
Stream
. For example, if you have a Stream
length of 4096B, and blockSize
of 1024B, then you have 4 blocks: 0, 1, 2, 3. - You can easily jump to block #3 by setting the
stream
position to 3*1024B = 3072.
Reading and Writing to a Block
- The Block implementation provides
Read()
and Write()
methods to work with block contents. They are just simple wrappers around reading and writing byte array into/from the underlying Stream
. - The very first few bytes of each block are reserved for headers. A header is simply a
LONG
value (8 bytes), and it can be anything, a block does not care about its headers, except that it allows clients to read and write into it. - When a client finishes using a block, it must call
Dispose
for all the cache to be flushed.
NOTES
In this implementation, I cache the first 4KB which usually contains both header and part of the body content to optimize performance, as file system typically reads and writes in 4KB block, it saves some write calls. However, it results in non-linear write for blocks larger than 4KB (as the first 4KB cached data always gets written last).
This should be fixed on write sensitive programs by either smarter flushing the cache or cache the entire block in memory, then writing them in sequence when a Block is Disposed.
Interesting classes to look at:
- FooCore/RecordStorage.cs
As you can see, RecordStorage
initializes with IBlockStorage
as dependency, as we designed RecordStorage
to be built on top of BlockStorage
.
Creating Records
RecordStorage
creates a record for an arbitrary data byte array by splitting specified data into multiple blocks, like a linked list. - It links one block to the other using metadata headers (field #0
kNextBlockId
, and #3 kPreviousBlockId
) to identify a block before and after a particular block that makes up a single Record. - It uses a block header (field #2
kBlockContentLength
) to mark the actual useable data of a Block
. This allows a record to have any size, instead of being just multiple of a particular block size. - Allocating new record done by reusing deleted blocks whenever it is possible, (i.e. "popping" an
uint
from record #0), or asking the underlying BlockStorage
to allocate blocks.
Finding Records
- Finding a record is easy, as
ID
of the record is also the ID
of the first block that it is made up from. - If the first record has a next block, grab the next block, and continue until the last block is reached.
Deleting Records
- Deleting a record is done by marking a record and all its blocks as deleted (field #4
kIsDeleted
) - It uses a special record, record #0, to keep track of deleted blocks. This record is like a Stack of
uint
. When a record gets deleted, a uint
number is "pushed" to the stack.
Updating Records
- If new record data uses the same number of blocks as old data, then
RecordStorage
updates content of existing blocks. - If new record data uses more blocks than old data,
RecordStorage
allocates (or reuse) more blocks as required. - If new record data uses less block than old data, unused blocks will be marked as deleted.
Interesting classes to look at are as follows:
- FooCore/Tree.cs
- FooCore/TreeDiskNodeManager.cs
- FooCore/TreeDiskNodeSerializer.cs
- FooCore/TreeEnumerator.cs
- FooCore/TreeHelper.cs
- FooCore/TreeMemoryNodeManager.cs
- FooCore/TreeNode.cs
- FooCore/TreeTraverser.cs
B-Tree implementation is already done by other people, we just need to translate their algorithms into code. I followed these articles implementing a B-Tree - this, this and this. (It took me more than 18 hours anyways).
The tricky part is the tree needs to be accessible from hard drive, not memory. As we access part of the tree, only that part gets loaded from hard drive into memory, not the entire tree.
To do that, we just need to make sure of 2 things:
TreeNode
is serializable (can be converted/constructed to/from byte array). As a Tree
made up from multiple TreeNodes
, a Tree
is serializable if its nodes are serializable. - Accessing to any
TreeNode
must be done through an instance of ITreeNodeManager, i.e., a TreeNode
should not keep direct in-memory reference to another TreeNode
, instead, they keep ID
of others.
For development and unit testing, I mocked up an implementation of ITreeNodeManager, the TreeMemoryNodeManager. Later on, I wrote TreeDiskNodeManager for production use, along with TreeDiskNodeSerializer helper to serialize TreeNode.
Creating CowDatabase
at this point will be fairly easy and quick as we have the RecordStorage
and Indexing built.
Please check out my implementation of the CowDatabase
as follows:
- FooApplication/FooDatabase.cs
- FooApplication/CowSerializer.cs
Initialization
FooDatabase
initializes with a path to the database as its argument, and it uses a naming convention to locate two other indexes (primary index of cow Guid
, and secondary index of Breed
and Age
).
It then constructs a primary index, a Tree<Guid, uint>
(mapping a cow
's GUID
to its record ID) and a secondary index Tree<Tuple<string, int>, uint>
(mapping a cow
's Breed
and Age
to its record ID).
Insertion, Deletion and Update of Cows
CowDatabase
uses a RecordStorage
instance to store serialized data of cows, with the help of CowSerializer. Whenever a cow is inserted or removed or updated, it calls Create()
, Remove()
, Update()
respectively on RecordStorage
, as well as creating and changing the required index.
Searching for a Cow or Cows
Searching for a cow
/cows
is done by looking at either primary or secondary index for the actual record ids, then those ids are used with RecordStorage
to read cow
data.
Now, we can start storing and querying our cows
as in FooApplication/Program.cs; Enjoy your custom built database!
Point of Interests
I used mono on Mac to write this piece of demo, very satisfying performance however, I got around 30-40% CPU time spend on running C# code instead of doing actual IO. (I was expecting it to be around 10-20%). Still, I think C# better to be run on Windows.
I also faced an issue where my mac caches all the write calls into memory, and I have no way to flush it, even with FileStream.Flush(bool true)
. I think this is a mono bug?
Anyways, a fun weekend project, hope everyone enjoys it.