Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / database / MySQL

GUIDs as fast primary keys under multiple databases

4.92/5 (89 votes)
15 Mar 2013CPOL20 min read 282.2K  
Using the right approach can make GUIDs nearly as fast as integer primary keys on almost any database system.

Introduction 

This article outlines an approach for using GUID values as primary keys/clustered indexes that avoids most of the normal disadvantages, adapting the COMB model for sequential GUIDs developed by Jimmy Nilsson in his article The Cost of GUIDs as Primary Keys.  While that basic model has been used by a variety of libraries and frameworks (including NHibernate), most implementations seem to be specific to Microsoft SQL Server.  This article attempts to adapt the approach into a flexible system that's can be used with other common database systems such as Oracle, PostgreSQL, and MySQL, and also addresses some of the eccentricities of the .NET Framework in particular. 

Background    

Historically, a very common model for database design has used sequential integers to identify a row of data, usually generated by the server itself when the new row is inserted.  This is a simple, clean approach that's suitable for many applications.  

However, there are also some situations where it's not ideal.  With the increasing use of Object-Relational Mapping (ORM) frameworks such as NHibernate and the ADO.NET Entity Framework, relying on the server to generate key values adds a lot of complication that most people would prefer to avoid.  Likewise, replication scenarios also make it problematic to rely on a single authoritative source for key value creation -- the entire point is to minimize the role of a single authority. 

One tempting alternative is to use GUIDs as key values.  A GUID (globally unique identifier), also known as a UUID, is a 128-bit value that carries a reasonable guarantee of being unique across all of space and time.  Standards for creating GUIDs are described in RFC 4122, but most GUID-creation algorithms in common use today are either essentially a very long random number, or else combine a random-appearing component with some kind of identifying information for the local system, such as a network MAC address.   

GUIDs have the advantage of allowing developers to create new key values on the fly without having to check in with the server, and without having to worry that the value might already be used by someone else.  At first glance, they seem to provide a good answer to the problem.

So what's the issue?  Well, performance.  To get the best performance, most databases store rows in what's known as a clustered index, meaning that the rows in a table are actually stored on disk in a sorted order, usually based on a primary key value.  This makes finding a single row as simple as doing a quick lookup in the index, but it can make adding new rows to the table very slow if their primary key doesn't fall at the end of the list.  For example, consider the following data:

ID  Name 
Holmes, S. 
Watson, J. 
Moriarty, J. 

Pretty simple so far: the rows are stored in order according to the value of the ID column.  If we add a new row with an ID of 8, it's no problem: the row just gets tacked on to the end. 

ID Name 
Holmes, S. 
Watson, J. 
Moriarty, J. 
8 Lestrade, I. 

But now suppose we want to insert a row with an ID of 5:

ID Name 
1  Holmes, S. 
Watson, J. 
5Hudson, Mrs.
Moriarty, J. 
Lestrade, I. 

Rows 7 and 8 have to be moved down to make room.  Not such a big deal here, but when you're talking about inserting something into the middle of a table with millions of rows, it starts becoming an issue.  And when you want to do it a hundred times a second, it can really, really add up.  

And that's the problem with GUIDs: they may or may not be truly random, but most of them look random, in the sense that they're not usually generated to have any particular kind of order.  For that reason, it's generally considered a very bad practice to use a GUID value as part of a primary key in a database of any significant size.  Inserts can be very slow and involve a huge amount of unnecessary disk activity. 

Sequential GUIDs   

So, what's the solution?  Well, the main problem with GUIDs is their lack of sequence.  So, let's add a sequence.  The COMB approach (which stands for COMBined GUID/timestamp) replaces a portion of the GUID with a value which is guaranteed to increase, or at least not decrease, with each new value generated.  As the name implies, it does this by using a value generated from the current date and time.  

To illustrate, consider this list of typical GUID values:  

fda437b5-6edd-42dc-9bbd-c09d10460ad0
2cb56c59-ef3d-4d24-90e7-835ed5968cdc
6bce82f3-5bd2-4efc-8832-986227592f26
42af7078-4b9c-4664-ba01-0d492ba3bd83 

Note that the values aren't in any particular order and appear essentially random.  Inserting a million rows with this type of value as the primary key could be quite slow. 

Now consider this hypothetical list of special GUID values:   

00000001-a411-491d-969a-77bf40f55175
00000002-d97d-4bb9-a493-cad277999363
00000003-916c-4986-a363-0a9b9c95ca52
00000004-f827-452b-a3be-b77a3a4c95aa 

The first block of digits has been replaced with an increasing sequence -- say, the number of milliseconds since the program started.  Inserting a million rows of these values wouldn't be so bad, since each row would simply be appended to the end of the list and not require any reshuffling of existing data. 

Now that we have our basic concept, we need to get into some of the details of how GUIDs are constructed and how they're handled by different database systems. 

128-bit GUIDs are composed of four main blocks, called Data1, Data2, Data3, and Data4, which you can see in the example below: 

11111111-2222-3333-4444-444444444444

Data1 is four bytes, Data2 is two bytes, Data3 is two bytes, and Data4 is eight bytes (a few bits of Data3 and the first part of Data4 are reserved for version information, but that's more or less the structure).  

Most GUID algorithms in use today, and especially those used by the .NET Framework, are pretty much just fancy random number generators (Microsoft used to include the local machine's MAC address as part of the GUID, but discontinued that practice several years ago due to privacy concerns).  This is good news for us, because it means that playing around with different parts of the value is unlikely to damage the value's uniqueness all that much.  

But unfortunately for us, different databases handle GUIDs in different ways.  Some systems (Microsoft SQL Server, PostgreSQL) have a built-in GUID type which can store and manipulate GUIDs directly.  Databases without native GUID support have different conventions on how they can be emulated.  MySQL, for example, most commonly stores GUIDs by writing their string representation to a char(36) column.  Oracle usually stores the raw bytes of a GUID value in a raw(16) column. 

It gets even more complicated, because one eccentricity of Microsoft SQL Server is that it orders GUID values according to the least significant six bytes (i.e. the last six bytes of the Data4 block).  So, if we want to create a sequential GUID for use with SQL Server, we have to put the sequential portion at the end.  Most other database systems will want it at the beginning. 

The Algorithm  

Looking at the different ways databases handle GUIDs, it's clear that there can be no one-size-fits-all algorithm for sequential GUIDs; we'll have to customize it for our particular application.  After doing some experimentation, I've identified three main approaches that cover most pretty much all use cases:

  • Creating a GUID which is sequential when stored as a string 
  • Creating a GUID which is sequential when stored as binary data 
  • Creating a GUID which is sequential on SQL Server, with the sequential portion at the end  

(Why aren't GUIDs stored as strings the same as GUIDs stored as bytes?  Because of the way .NET handles GUIDs, the string representation may not be what you expect on little-endian systems, which is most machines likely to be running .NET.  More on that later.)    

I've represented these choices in code as an enumeration:

C#
public enum SequentialGuidType
{
  SequentialAsString,
  SequentialAsBinary,
  SequentialAtEnd
}   

Now we can define a method to generate our GUID which accepts one of those enumeration values, and tailor the result accordingly:

C#
public Guid NewSequentialGuid(SequentialGuidType guidType)
{
  ... 
} 

But how exactly do we create a sequential GUID?  Exactly which part of it do we keep "random," and which part do we replace with a timestamp?  Well, the original COMB specification, tailored for SQL Server, replaced the last six bytes of Data4 with a timestamp value.  This was partially out of convenience, since those six bytes are what SQL Server uses to order GUID values, but six bytes for a timestamp is a decent enough balance.  That leaves ten bytes for the random component.   

What makes the most sense to me is to start with a fresh random GUID.  Like I just said, we need ten random bytes:    

C#
var rng = new System.Security.Cryptography.RNGCryptoServiceProvider();
byte[] randomBytes = new byte[10];
rng.GetBytes(randomBytes);  

We use RNGCryptoServiceProvider to generate our random component because System.Random has some deficiencies that make it unsuitable for this purpose (the numbers it generates follow some identifiable patterns, for example, and will cycle after no more than 232 iterations).  Since we're relying on randomness to give us as much of a guarantee of uniqueness as we can realistically have, it's in our interests to make sure our initial state is as strongly random as it can be, and  RNGCryptoServiceProvider provides cryptographically strong random data. 

(It's also relatively slow, however, and so if performance is critical you might want to consider another method -- simply initializing a byte array with data from Guid.NewGuid(), for example.  I avoided this approach because Guid.NewGuid() itself makes no guarantees of randomness; that's just how the current implementation appears to work.  So, I choose to err on the side of caution and stick with a method I know will function reliably.) 

Okay, we now have the random portion of our new value, and all that remains is to replace part of it with our timestamp.  We decided on a six-byte timestamp, but what should it be based on?  One obvious choice would be to use DateTime.Now (or, as Rich Andersen points out, DateTime.UtcNow for better performance) and convert it to a six-byte integer value somehow.  The Ticks property is tempting: it returns the number of 100-nanosecond intervals that have elapsed since January 1, 0001 A.D.  However, there are a couple hitches. 

First, since Ticks returns a 64-bit integer and we only have 48 bits to play with, we'd have to chop off two bytes, and the remaining 48 bits' worth of 100-nanosecond intervals gives us less than a year before it overflows and cycles.  This would ruin the sequential ordering we're trying to set up, and destroy the performance gains we're hoping for, and since many applications will be in service longer than a year, we have to use a less precise measure of time. 

The other difficulty is that DateTime.UtcNow has a limited resolution.  According to the docs, the value might only update every 10 milliseconds.  (It seems to update more frequently on some systems, but we can't rely on that.)    

The good news is, those two hitches sort of cancel each other out: the limited resolution means there's no point in using the entire Ticks value.  So, instead of using ticks directly, we'll divide by 10,000 to give us the number of milliseconds that have elapsed since January 1, 0001, and then the least significant 48 bits of that will become our timestamp.  I use milliseconds because, even though DateTime.UtcNow is currently limited to 10-millisecond resolution on some systems, it may improve in the future, and I'd like to leave room for that.  Reducing the resolution of our timestamp to milliseconds also gives us until about 5800 A.D. before it overflows and cycles; hopefully this will be sufficient for most applications. 

Before we continue, a short footnote about this approach: using a 1-millisecond-resolution timestamp means that GUIDs generated very close together might have the same timestamp value, and so will not be sequential.  This might be a common occurrence for some applications, and in fact I experimented with some alternate approaches, such as using a higher-resolution timer such as System.Diagnostics.Stopwatch, or combining the timestamp with a "counter" that would guarantee the sequence continued until the timestamp updated.  However, during testing I found that this made no discernible difference at all, even when dozens or even hundreds of GUIDs were being generated within the same one-millisecond window.  This is consistent with what Jimmy Nilsson encountered during his testing with COMBs as well.  With that in mind, I went with the method outlined here, since it's far simpler.  

Here's the code:

C#
long timestamp = DateTime.UtcNow.Ticks / 10000L;
byte[] timestampBytes = BitConverter.GetBytes(timestamp);       

Now we have our timestamp.  However, since we obtained the bytes from a numeric value using BitConverter, we have to account for byte order. 

if (BitConverter.IsLittleEndian)
{
  Array.Reverse(timestampBytes); 
}   

We have the bytes for the random portion of our GUID, and we have the bytes for the timestamp, so all that remains is to combine them.  At this point we have to tailor the format according to to the SequentialGuidType value passed in to our method.  For SequentialAsBinary and SequentialAsString types, we copy the timestamp first, followed by the random component.  For SequentialAtEnd types, the opposite. 

byte[] guidBytes = new byte[16];

switch (guidType)
{
  case SequentialGuidType.SequentialAsString:
  case SequentialGuidType.SequentialAsBinary:
    Buffer.BlockCopy(timestampBytes, 2, guidBytes, 0, 6);
    Buffer.BlockCopy(randomBytes, 0, guidBytes, 6, 10);
    break; 

  case SequentialGuidType.SequentialAtEnd:
    Buffer.BlockCopy(randomBytes, 0, guidBytes, 0, 10);
    Buffer.BlockCopy(timestampBytes, 2, guidBytes, 10, 6);
    break;
}    

So far, so good.  But now we get to one of the eccentricities of the .NET Framework: it doesn't just treat GUIDs as a sequence of bytes.   For some reason, it regards a GUID as a struct containing a 32-bit integer, two 16-bit integers, and eight individual bytes.  In other words, it regards the Data1 block as an Int32, the Data2 and Data3 blocks as two Int16s, and the Data4 block as a Byte[8].  

What does this mean for us?  Well, the main issue has to do with byte ordering again.  Since .NET thinks it's dealing with numeric values, we have to compensate on little-endian systems -- BUT! -- only for applications that will be converting the GUID value to a string, and have the timestamp portion at the beginning of the GUID (the ones with the timestamp portion at the end don't have anything important in the "numeric" parts of the GUID, so we don't have to do anything with them).  

This is the reason I mentioned above for distinguishing between GUIDs that will be stored as strings and GUIDs that will be stored as binary data.  For databases that store them as strings, ORM frameworks and applications will probably want to use the ToString() method to generate SQL INSERT statements, meaning we have to correct for the endianness issue.  For databases that store them as binary data, they'll probably use Guid.ToByteArray() to generate the string for INSERTs, meaning no correction is necessary.  So, we have one last thing to add: 

C#
if (guidType == SequentialGuidType.SequentialAsString && 
    BitConverter.IsLittleEndian)
{
  Array.Reverse(guidBytes, 0, 4);
  Array.Reverse(guidBytes, 4, 2);
} 

Now we're done, and we can use our byte array to construct and return a GUID:

C#
return new Guid(guidBytes);

Using the code    

To use our method, we first have to determine which type of GUID is best for our database and any ORM framework we're using.  Here's a quick rule of thumb for some common database types, although these might vary depending on the details of your application:

Database GUID Column SequentialGuidType Value 
Microsoft SQL Server uniqueidentifier SequentialAtEnd
MySQL char(36) SequentialAsString 
Oracle raw(16) SequentialAsBinary 
PostgreSQL uuid SequentialAsString 
SQLite  varies  varies 

(For SQLite databases, there is no native GUID column type, but there are extensions that mimic the functionality.  However, GUID values may be stored internally as either 16 bytes of binary data, or 36 characters of text, depending on the value of the BinaryGUID parameter passed in the connection string, so there's no "one-size-fits-all" answer.)

Here are a few examples generated by our new method. 

First, NewSequentialGuid(SequentialGuidType.SequentialAsString):   

39babcb4-e446-4ed5-4012-2e27653a9d13
39babcb4-e447-ae68-4a32-19eb8d91765d
39babcb4-e44a-6c41-0fb4-21edd4697f43
39babcb4-e44d-51d2-c4b0-7d8489691c70

As you can see, the first six bytes (the first two blocks) are in sequential order, and the remainder is random.  Inserting these values into a database that stores GUIDs as strings (such as MySQL) should provide a performance gain over non-sequential values.

Next, NewSequentialGuid(SequentialGuidType.SequentialAtEnd):   

a47ec5e3-8d62-4cc1-e132-39babcb4e47a
939aa853-5dc9-4542-0064-39babcb4e47c
7c06fdf6-dca2-4a1a-c3d7-39babcb4e47d
c21a4d6f-407e-48cf-656c-39babcb4e480 

As we'd expect, the last six bytes are sequential, and the rest is random.  I have no idea why SQL Server orders uniqueidentifier indexes this way, but it does, and this should work well.

And finally, NewSequentialGuid(SequentialGuidType.SequentialAsBinary):

b4bcba39-58eb-47ce-8890-71e7867d67a5
b4bcba39-5aeb-42a0-0b11-db83dd3c635b
b4bcba39-6aeb-4129-a9a5-a500aac0c5cd
b4bcba39-6ceb-494d-a978-c29cef95d37f 

When viewed here in the format ToString() would output, we can see that something looks wrong.  The first two blocks are "jumbled" due to having all their bytes reversed (this is due to the endianness issue discussed earlier).  If we were to insert these values into a text field (like they would be under MySQL), the performance would not be ideal. 

However, this problem is appearing because the four values in that list were generated using the ToString() method.  Suppose instead that the same four GUIDs were converted into a hex string using the array returned by Guid.ToByteArray()

39babcb4eb5847ce889071e7867d67a5
39babcb4eb5a42a00b11db83dd3c635b
39babcb4eb6a4129a9a5a500aac0c5cd
39babcb4eb6c494da978c29cef95d37f 

This is how an ORM framework would most likely generate INSERT statements for an Oracle database, for example, and you can see that, when formatted this way, the sequence is again visible.  

So, to recap, we now have a method that can generate sequential GUID values for any of three different database types: those that store as strings (MySQL, sometimes SQLite), those that store as binary data (Oracle, PostgreSQL), and Microsoft SQL Server, which has its own bizarre storage scheme. 

We could customize our method further, having it auto-detect the database type based on a value in the application settings, or we could create an overload that would accept the DbConnection we're using and determine the correct type from that, but that would depend on the details of the application and any ORM framework being used.  Call it homework! 

Benchmarks 

For testing, I focused on four common database systems: Microsoft SQL Server 2008, MySQL 5.5, Oracle XE 11.2, and PostgreSQL 9.1, all running under Windows 7 on my desktop.  (If someone would like to run tests on more database types or under other operating systems, I'd be happy to help any way I can!) 

Tests were performed by using each database system's command-line tool to insert 2 millions rows into a table with a GUID primary key and a 100-character text field.  One test was performed using each of the three methods described above, with a fourth test using the Guid.NewGuid() method as a control.   For comparison, I also ran a fifth test inserting 2 million rows into a similar table with an integer primary key.  The time (in seconds) to complete the inserts was recorded after the first million rows, and then again after the second million rows.  Here are the results: 

 

For SQL Server, we would expect the SequentialAtEnd method to work best (since it was added especially for SQL Server), and things are looking good: GUID inserts using that method are only 8.4% slower than an integer primary key -- definitely acceptable.  This represents a 75% improvement over the performance of a random GUID.  You can also see that the SequentialAsBinary and SequentialAsString methods provide only a small benefit over a random GUID, also as we would expect.  Another important indicator is that, for random GUIDs, the second millions inserts took longer than the first million, which is consistent with a lot of page-shuffling to maintain the clustered index as more rows get added in the middle, whereas for the SequentialAtEnd method, the second million took nearly the same amount of time as the first, indicating that new rows were simply being appended to the end of the table.  So far, so good. 

 

As you can see, MySQL had very poor performance with non-sequential GUIDs -- so poor that I had to cut off the top of the chart to make the other bars readable (the second million rows took over half again as long as the first million).  However, performance with the SequentialAsString method was almost identical to an integer primary key, which is what we'd expect since GUIDs are typically stored as char(36) fields in MySQL.  Performance with the SequentialAsBinary method was also similar, probably due to the fact that, even with incorrect byte order, the values are "sort of" sequential, as a whole. 

 

  

Oracle is harder to get a handle on.  Storing GUIDs as raw(16) columns, we would expect the SequentialAsBinary method to be the fastest, and it is, but even random GUIDs weren't too much slower than integers.  Moreover, sequential GUID inserts were faster than integer inserts, which is hard to accept.  While sequential GUIDs did produce measurable improvement in these benchmarks, I have to wonder if the weirdness here is due to my inexperience with writing good batch inserts for Oracle.  If anyone else would like to take a stab at it, please let me know! 

 

And finally, PostgreSQL.  Like Oracle, performance wasn't horrible even with random GUIDs, but the difference for sequential GUIDs was somewhat more pronounced.  As expected, the SequentialAsString method was fastest, taking only 7.8% longer than an integer primary key, and nearly twice as fast as a random GUID. 

Additional Notes  

There are a few other things to take into consideration.  A lot of emphasis has been placed on the performance of inserting sequential GUIDs, but what about the performance of creating them?  How long does it take to generate a sequential GUID, compared to Guid.NewGuid()?  Well, it's definitely slower: on my system, I could generate a million random GUIDs in 140 milliseconds, but sequential GUIDs took 2800 milliseconds -- twenty times slower. 

Some quick tests showed that the lion's share of that slowness is due to the use of RNGCryptoServiceProvider to generate our random data; switching to System.Random brought the result down to about 400 milliseconds.  I still don't recommend doing this, however, since System.Random remains problematic for these purposes.  However, it may be possible to use an alternate algorithm that's both faster and acceptably strong -- I don't know much about random number generators, frankly. 

Is the slower creation a concern?  Personally, I find it acceptable.  Unless your application involves very frequent inserts (in which case a GUID key may not be ideal for other reasons), the cost of occasional GUID creation will pale in comparison to the benefits of faster database operations. 

Another concern: replacing six bytes of the GUID with a timestamp means only ten bytes are left for random data.  Does this jeopardize uniqueness?  Well, it depends on the circumstances.  Including a timestamp means that any two GUIDs created more than a few milliseconds apart are guaranteed to be unique -- a promise that a completely random GUID (such as those returned by Guid.NewGuid()) can't make.  But what about GUIDs created very close together?  Well, ten bytes of cryptographically strong randomness means 280, or 1,208,925,819,614,629,174,706,176 possible combinations.  The probability of two GUIDs generated within a handful of milliseconds having the same random component is probably insignificant compared to, say, the odds of the database server and all its backups being destroyed by simultaneous wild pig attacks. 

One last issue is that the GUIDs generated here aren't technically compliant with the formats specified in RFC 4122 -- they lack the version number that usually occupies bits 48 through 51, for example.  I don't personally think it's a big deal; I don't know of any databases that actually care about the internal structure of a GUID, and omitting the version block gives us an extra four bits of randomness.  However, we could easily add it back if desired.

Final Code  

Here's the complete version of the method.  Some small changes have been made to the code given above (such as abstracting the random number generator out to a static instance, and refactoring the switch() block a bit):

C#
using System;
using System.Security.Cryptography;

public enum SequentialGuidType
{
  SequentialAsString,
  SequentialAsBinary,
  SequentialAtEnd
} 

public static class SequentialGuidGenerator
{
  private static readonly RNGCryptoServiceProvider _rng = new RNGCryptoServiceProvider();

  public static Guid NewSequentialGuid(SequentialGuidType guidType)
  {
    byte[] randomBytes = new byte[10];
    _rng.GetBytes(randomBytes);

    long timestamp = DateTime.UtcNow.Ticks / 10000L;
    byte[] timestampBytes = BitConverter.GetBytes(timestamp);

    if (BitConverter.IsLittleEndian)
    {
      Array.Reverse(timestampBytes);
    }

    byte[] guidBytes = new byte[16];

    switch (guidType)
    {
      case SequentialGuidType.SequentialAsString:
      case SequentialGuidType.SequentialAsBinary:
        Buffer.BlockCopy(timestampBytes, 2, guidBytes, 0, 6);
        Buffer.BlockCopy(randomBytes, 0, guidBytes, 6, 10);

        // If formatting as a string, we have to reverse the order
        // of the Data1 and Data2 blocks on little-endian systems.
        if (guidType == SequentialGuidType.SequentialAsString && BitConverter.IsLittleEndian)
        {
          Array.Reverse(guidBytes, 0, 4);
          Array.Reverse(guidBytes, 4, 2);
        }
        break;

      case SequentialGuidType.SequentialAtEnd:
        Buffer.BlockCopy(randomBytes, 0, guidBytes, 0, 10);
        Buffer.BlockCopy(timestampBytes, 2, guidBytes, 10, 6);
        break;
    }

    return new Guid(guidBytes);
  }
} 

Final code and a demo project can be found at https://github.com/jhtodd/SequentialGuid[^]

Conclusion  

As I mentioned at the beginning, the general COMB approach has been used fairly heavily by various frameworks, and the general concept isn't particularly new, and certainly not original to me.  My goal here was to illustrate the ways in which the approach has to be adapted to fit different database types, as well as to provide benchmark information underscoring the need for a tailored approach. 

With a little effort and a moderate amount of testing, it's possible to implement a consistent way of generating sequential GUIDs that can easily be used as high-performance primary keys under pretty much any database system.

History 

v1 - Wrote the thing!

v2 - Adopted Rich Andersen's suggestion of using DateTime.UtcNow instead of DateTime.Now, for improved performance and updated code formatting.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)