Introduction
This article summarizes the things that I learned when I wrote a new MySQL storage engine (“SE”). MySQL’s storage engines are plugins which are responsible for actually storing the data on disk, and providing access to the data. The SEs usually are implemented as key/value databases, but not necessarily have to. Oracle’s MySQL comes with a variety of SEs, and the whole SE landscape can be a bit overwhelming. The default is InnoDB, a key/value database library. Other SEs can read and write CSV files (“Tina”) or were written specifically for archiving the data (“archive”). MariaDB, a fork, is more progressive and includes SEs for accessing Cassandra NoSQL databases, Percona’s TokuDB key/value library and xtradb, an improved version of InnoDB.
In total, MySQL comes with 13 SEs, MariaDB has about 20. That’s clearly not enough. Let’s add another one!
Downloading Code and First Impressions
You can download the MySQL code from GitHub. I am working with version 5.7.12.
git clone https://github.com/mysql/mysql-server.git
cd mysql-server
git checkout mysql-5.7-12
Let’s have a brief look at the source code structure. The amount of files and directories is a bit intimidating, but most of them are not relevant for us. There are just a few directories which are worth a closer look:
include Stores global include files. Here you will find mysql.h,
which has many important macros and declarations.
sql Contains the actual sql related code: parser, query engine etc.
sql/field.h The Field class describes a MySQL column
sql/item.h The Item class is a node in the abstract syntax tree
which is generated by the parser
sql/sql_*.cc These files implement the actual SQL operations
sql/handler.h The base class for a storage engine
storage This directory has all the storage engines
storage/innobase The InnoDB storage engine
storage/example A storage engine example project
After browsing through the code, it becomes clear that parts of MySQL’s code base are relatively old. While most code is implemented as C++, it’s actually a “C with classes”. More advanced C++ features like templates, exceptions or the standard library are not used. Linked lists are implemented as structures wrapping a “void *
” object (my_list.h). MySQL also has its own string
implementation (string.c) instead of using std::string
. RAII is also not used. Instead, you will find many “goto
” instructions which are responsible for cleaning up allocated resources.
Functions can be extremely long (more than 1000 lines of code, with many levels of nested “if
” clauses - mysql_prepare_create_table), and some of the classes are also huge. Most methods are stored in the base class instead of derived classes, and in addition, the classes store a lot of state.
MySQL uses several memory allocators (memroot_allocator.h). They are not STL compatible. Many classes overwrite operator new
and operator delete
to use a custom allocator.
The code has no consistent coding style. Indention varies, structure names are sometimes capitalized (“struct THD
”), lower case with underscores (“st_mem_root
”) or start with a capital letter and continue with underscores (“Field_long
”). The use of spaces varies as well (foo= 3; foo = 3; foo=3
).
I have never spoken to someone working for Oracle or MySQL AB, but we can try to make a few guesses about their corporate culture based on the code we have seen. The lack of a coding style means that each engineer can pick the one he’s most familiar with. Coding styles vary a lot, even in the same file, which can be a sign that there is no “code ownership” and everyone can work on every part of the code. I have worked with similar code in the past, in a similar environment, and it was a great place to work.
The use of “C with classes”, the refusal to use the STL and the archaic long and complex functions will require lots of refactoring if MySQL wants to get rid of its technical debt and attract new developers.
Compile, Install, Run, Debug
Let’s continue with the installation. If you followed the steps above, then you already checked out the code into the “mysql-server” directory. In here, we will run cmake
, which will generate the Makefiles. We will create a Debug
build which is slower but can be used for debugging.
cd mysql-server
cmake -DCMAKE_BUILD_TYPE=Debug
make -j 5
sudo make install
Your compiled MySQL files are now installed in /usr/local/mysql. You can run cmake
in a subdirectory to have separate builds for Debug and Release, and you can choose different installation directories. Run “cmake --help
” for a list of options. For this article, I’ll just keep everything as simple as possible.
Now there are just a few more things to do till we can successfully start the MySQL daemon. We need to create a new data directory (which stores the tables) and initialize it (note that you will have to change the path names in the following commands):
mkdir /home/chris/tmp/mysql-data
cd /usr/local/mysql/bin
./mysqld --initialize --datadir=/home/chris/tmp/mysql-data
The last command will generate a root password. Write it down, you will need it in a moment. I’ve set up my development environment with an empty root password - it’s just more convenient while testing. You can reset the password with the following commands (replace <PASSWORD>
with the generated root password from above):
# start the server
./mysqld --datadir=/home/chris/tmp/mysql-data
# in a separate terminal we can now change the password
./mysqladmin password --user=root --password=<password>
Now we can create a new database:
./mysqladmin create test --user=root --password
The client can be started as well, and then you can finally create tables, insert data, etc.
./mysql --user=root test
Bootstrapping a New Storage Engine
A storage engine is implemented as a dynamic library (a “.so” file), and the sources are stored in the “mysql-server/storage” directory. If you just want to toy around then you can modify the existing “example” SE. I chose to create my own called “upscaledb
” with the following steps:
cd mysql-server/storage
cp -r ha_example ha_upscaledb
cd ha_upscaledb
mv ha_example.h ha_upscaledb.h
mv ha_example.cc ha_upscaledb.cc
As a last step, use your favourite IDE (I use vim) to replace “example
” with “upscaledb
” and “EXAMPLE
” with “UPSCALEDB
”. Don’t forget to change the CMakeLists.txt file as well. Then go to the root directory of mysql-server
and once more run “cmake
” and “make
”. The new upscaledb
storage engine is now built, the file name is mysql-server/storage/upscaledb/ha_upscaledb.so.
Now we have to inform MySQL about the new storage engine. First, we create a symbolic link from the installation directory to the new .so file. With this link, our server will always use the newest version of the .so file. Whenever we make changes, we simply compile the storage engine and restart the MySQL server.
cd /usr/local/mysql/lib/plugin
sudo ln -s ~/prj/mysql-server/storage/upscaledb/ha_upscaledb.so ha_upscaledb.so
The final step is to update MySQL’s internal system table. We can do this with the MySQL client. Make sure that the MySQL server is still running!
cd /usr/local/mysql/bin
./mysql --user=root test
mysql> INSTALL PLUGIN upscaledb SONAME ‘ha_upscaledb.so’;
You can now start using the new SE and try to create tables (CREATE TABLE test (value INTEGER) ENGINE=upscaledb;
). Since our SE is just a skeleton without any implementation, we will get an error. Before we start adding logic, I’ll show you how to debug the MySQL server. The following command starts the server in gdb
, and lets gdb
catch all signals (i.e., CTRL-C breaks into the debugger):
cd /usr/local/mysql/bin
gdb --args ./mysqld --gdb --datadir=/home/chris/tmp/mysql-data
Try setting a breakpoint in the “create()
” method of your SE, and run the CREATE TABLE
command from above once more!
Adding Functionality
It’s now time to fill in the various methods of our Handler
class. It would need too much time to describe every method in detail. Besides, some of them are still unknown to me. But I will give an overview, and provide links to actual implementations.
It is important to be aware that there can be multiple instances of a Handler
working on the same table. The actual table data therefore needs to be stored in a separate object (a “Share
”), and the Handler
s, then acquire the Share
. If your data is stored in a plain file, then the file handle would be a member of the Share
, not of your Handler
.
You can look at the ExampleShare
here and my UpscaledbShare
here. You see that the UpscaledbShare
stores the actual database handles, and additional metadata about the databases.
Creating Tables
ExampleHandler::create and UpscaledbHandler::create are called whenever a table is created. Immediately afterwards, MySQL will call the open()
method on that table. Your create()
method therefore can prepare the table, but does not actually have to open it.
The implementation of UpscaledbHandler::create()
is straight forward. It creates an upscaledb
Environment, then creates a database for each index. If the user created a table without a primary key, then an index is generated. The database configuration depends on the actual type of the column and a few other parameters (i.e., whether it is unique or not).
Opening Tables
ExampleHandler::open and UpscaledbHandler::open are called whenever a table is opened. As mentioned above, this can (and will) happen many times for the same table. You therefore have to store the actual table data in your “Share
”.
After retrieving a pointer to the Share
object, the UpscaledbHandler::open()
method checks if the upscaledb
environment was already opened. If yes, it returns immediately. If not, then it proceeds with opening the file, and stores the environment’s handle in the “Share
”.
Closing Tables
The ExampleHandler::close and UpscaledbHandler::close methods are supposed to close the table. If your table data is stored in the Share
, then you might use reference-counting to figure out when to destroy the Share
object. In my UpscaledbHandler
, I never destroy the share; after all, the Share
will be required again sooner or later.
INSERTing Rows
Whenever you invoke the INSERT SQL
statement, the write_row()
method of your handler is invoked. Its only parameter is the new row, serialized in a byte array. This array stores the actual columns in different order than the one you specified when calling the CREATE TABLE
statement. The primary key is always at the beginning, followed by all other indexed columns, and finally followed by the non-indexed column.
This byte array usually starts with an (optional) bitmap describing the null
values in the current column. It is followed by either fixed length columns, or columns with variable length (a TINYTEXT
, MEDIUMTEXT
, TEXT
, LONGTEXT
or one of the corresponding BLOB
columns). The variable length columns start with one or two bytes storing the size of the column, followed by the data (which can be stored in a separate memory chunk; in this case the byte array contains an encoded pointer to the actual data). If you persist a row in a file, it makes sense to “condense
” variable length rows in a more compact format, in order to reduce space.
My UpscaledbHandler
caches the indexed Field
objects to quickly extract the indexed columns. The following code can then be used to extract the key for an indexed column (“index
” is the numerical ID of the index; the primary index is always 0
).
static inline ups_key_t
key_from_row(TABLE *table, const uchar *buf, int index)
{
KEY_PART_INFO *key_part = table->key_info[index].key_part;
uint16_t key_size = (uint16_t)key_part->length;
uint32_t offset = key_part->offset;
if (key_part->type == HA_KEYTYPE_VARTEXT1
|| key_part->type == HA_KEYTYPE_VARBINARY1) {
key_size = buf[offset];
offset += 1;
}
else if (key_part->type == HA_KEYTYPE_VARTEXT2
|| key_part->type == HA_KEYTYPE_VARBINARY2) {
key_size = *(uint16_t *)&buf[offset];
offset += 2;
}
ups_key_t key = ups_make_key((uchar *)buf + offset, key_size);
return key;
}
Having extracted the index keys, you can then store the row’s data in your file. You will have to deal with three cases:
- The user did not specify ANY index or primary key
- The user specified a primary key but no other indexes
- The user specified a primary key and additional indexes
If that’s not complex enough, then also keep in mind that an index can be “virtual
”, i.e., it combines multiple columns (as in the following statement).
CREATE TABLE test (
id INT NOT NULL,
last_name CHAR(30) NOT NULL,
first_name CHAR(30) NOT NULL,
PRIMARY KEY (id),
INDEX name (last_name, first_name)
);
Here is the implementation of the UpscaledbHandler::write_row method.
DELETEing Rows
The DELETE SQL
command ends up calling the delete_row()
method of your handler. You have to make sure that ALL indices are deleted, not just the primary one. However, deleting the primary key is actually quite simple because MySQL will use a database cursor to locate it. Here is the UpscaledbHandler::delete_row implementation.
UPDATEing Rows
This is the most complex one - at least if you want to make it fast. The update_row() method receives two parameters; the old row’s value and the new row’s value. The naive solution is to call delete_row()
for the old row, then write_row()
for the new row. This works, but is really slow because you end up updating all columns, even if just one was updated. It is much faster to only update those columns that were modified.
Cursors
For a lot of tasks, the MySQL core will just create a database cursor, locate a key (either on the primary or on a secondary index) and then move forward while working with the actual data. There are a few methods that you have to implement to support the cursors.
- index_init(): creates a cursor for a secondary index
- index_end(): can close the cursor
- index_read_map(): positions the cursor on a row
- index_next(): moves cursor to the next key, retrieves the row
- index_prev(): moves cursor to the previous key, retrieves the row
- index_last(): moves cursor to the last key, retrieves the row
- index_first(): moves cursor to the first key, retrieves the row
- index_next_same(): moves cursor to the next duplicate of the current key
- rnd_init(): creates a cursor for the primary index
- rnd_end(): closes the cursor
- rnd_next(): moves cursor to the next key, retrieves the row
- rnd_pos(): moves cursor to a specified row
Other Methods
A few other methods which are important or interesting enough to be mentioned.
- rename_table(): renames a table (and all its files). This is used whenever your schema is changed, i.e., because you add a column. MySQL copies all the data to a temporary table, then uses the
rename_table()
method to rename the temporary table into your original table. - delete_table(): deletes a table (and all its files).
- table_flags(): returns a set of flags which describe the capabilities of your handler. A lot of these are not well documented, and many are geared towards InnoDB. My guess is that only InnoDB implements all capabilities.
- index_flags(): returns a set of flags which describe the capabilities of your handler - i.e., whether your
index_prev()
and index_next()
methods are provided, etc.
Conclusion
Writing your own MySQL Storage Engine sounds like a complex endeavor, but actually it is not. You do not have to implement all the methods that I described above. For some methods, MySQL will deal with a missing implementation and provide its own. For others, it will simply abort some SQL commands with an error. The ExampleHandler
is basically empty and does not provide any functionality. But nevertheless you can load it, place breakpoints in a debugger and start adding functionality piece by piece. If you get stuck, then you can have a look at existing SEs of MySQL and MariaDB. And the developers on the mysql-internals mailing list are also very helpful.
Ideas for interesting SEs coming to my mind:
- An append-only database which persists its data in HDFS; users can then use Spark or Hadoop’s Map/Reduce jobs to further process the data
- In-memory tables based on
std::map
or std::multi_map
- SEs which are backed by XML files
What other ideas do you have?
History
- 20th June, 2016: Initial version