Introduction
Throughout my travels in the IT industry, I have seen many database designs and it never fails to amaze me how many systems simply ignore the fundamentals of good design. I am not talking about poor designs from an academic point of view; these sub-standard designs don't work because they are a performance, programming, scalability and maintenance nightmare. One of problems I regularly see is implementing an improper key design in a table. In this article, I will discuss each of the crimes perpetrated against proper key design and each law that was broken. Names will be changed to protect the innocent.
The First Law: The primary key is a unique identifier for each row in a table.
The Crime: Guilty by Absence
The worst key crime is not adding a key at all to a table. What is amazing is that sometimes this will actually work, but I can predict with almost 100% certainty that it will fail at some point in time. Although it may not be efficient, it is possible to fetch the specific record you are looking for without a key defined. The same cannot be true when trying to update a record. In the old days of OLEDB and ADO, when the database driver could not get any key information concerning the Recordset
, it would try to create its own key. How did it do this? A WHERE
clause was generated with a condition against every field in the Recordset
. This was done assuming there were no duplicate records in the table. Basically, under this condition, you had no idea what data was affected. Perhaps there were valid duplicate records. A single row update could change more than one record. Maybe a value had changed since the last dataset read, so that the WHERE clause would filter out all records and no records would be updated. This was the most inefficient and most error prone and bug ridden method you could devise. Fortunately, this “feature” was fixed in ADO.NET. You cannot update data without a key defined.
The Second Law: The primary key should be immutable.
The Crime: Guilty by Change
Here is one quick fix to the unique primary key problem that is not recommended. If you want to read or write to a specific row, that row needs a unique identifier that guarantees you are pointing to the record you want. SQL Server has a handy datatype that seems to suit this purpose quite well. The TIMESTAMP datatype provides a unique, sequential number. This datatype does not really represent a date or time. It simply is an 8 byte number that gets larger as time goes on. Why not stick a TIMESTAMP field at the end of table, define it the primary key and be done with it? There are several serious problems with this approach.
The TIMESTAMP data type was designed to control concurrency on data reads and writes. Every time any field gets updated, this field increases in value to the current server timestamp value. If a table has a TIMESTAMP field, the ADO.NET driver can check this value to see if the underlying data has changed since the data was last read. The database driver can then raise an error if updates are being made or you can use this information to determine if data is stale without having to grab all the record data. Unfortunately, this behavior is the last thing you want a primary key to do.
If you are using SQL Server as a relational database (and I hope you are), then you would have relations defined with primary and foreign keys. You would probably also have cascading updates set against the key value of the parent table. In larger systems, it is not uncommon to have dozens of relationships defined. If a TIMESTAMP was used as a primary key, every time any data update on any field is made, the TIMESTAMP would change. This would cause any related table to cascade the key update. The server may have to update many different tables and indexes to retain data integrity. This is a huge amount of extra work the server has to perform that it is perfectly meaningless and unnecessary. It is not difficult to see how this approach can bring a server to its’ knees with moderate traffic.
Primary keys should not change when they are created; they should be immutable. This eliminates the cascading updates problem and also prevents a couple others. By specifying a clustered index in the primary key, the records are physically stored in the order of the index. Retrieving and updating the data is much faster. If the key is changed, then the table becomes fragmented slowing down performance. Using immutable keys is a huge consideration for high-performance databases.
The Third Law – Keys are for the benefit of the computer.
The Crime: Guilty by Association
I understand that in the bad old days, bytes were expensive and people tried to get as much information as possible stored in as little space as possible (hence the year 2000 problem). Sometimes old habits are hard to break. We all know that all tables require keys, so why not use some of the data stored in the table as the key? People try to use a natural key based on some data elements unique to the data. I have never seen this work properly.
Determining a unique key is difficult. At some point in time, it may appear you have a good candidate for a key. Later as the system grows, you might realize that data changes cause your key to no longer be unique. Since the key is based in real data, it will most likely change over time causing the previous mentioned problems if the key is not immutable. Finding a single field value that is unique can also be a problem, so multicolumn keys are often used. This is a real pain to work with since you have to provide all the values to retrieve a record or you would have to perform multiple joins to get data from another related table.
The argument I hear for using natural keys is that they are easier to read and have some meaning to them. This is precisely the reason natural keys shouldn't be used. I have seen all sorts of wacky schemes to try and generate keys. A typical scheme to generate a key value might be something like:
-
023409808
where 0234 is the branch number, 098 is a sequential transaction number and 08 is the year. The problems with this are obvious. What if the branch or year changes? What if the company grows and there are more than 999 branches? What if two transactions are inserted at the same time and the both generate the same transaction number? Is this stored as a numeric value or a character datatype? This type of scheme is just asking for trouble. If you are ever tempted to use such a method, smack yourself in the head and quickly forget it. If you need some kind of unique system generated reference number, put that in a different column, but don't use it as you key. You will be saving yourself huge amounts of grief in the present and the future.
Keys are not meant to be readable by humans, they are for the benefit of the computer and the developer. If you want to provide human readable data, create views showing the information that you want, but don't store the data in that way.
An interesting side note: This scheme can support a maximum of 9,989,001 transactions per year for 99 years for a total of 988,911,099 transactions. An arbitrary int identity key can support a total of 2,147,483,647 transactions. You can double this with a clever trick of starting with an identity seed value of -2,147,483,648 and incrementing one each time. The computer doesn't care if the key is negative, it’s all just numbers. Assuming you choose to store the natural key scheme in an int field, using an arbitrary key allows you to have over 4 times the transactions in the same storage space.
The Fourth Law – Create efficient numeric keys.
The Crime: Guilty by Excess
Understand your data. Make your keys the correct size. In the goofy example above was used as a key stored in a character datatype (because you want the leading zero to show up), it would consume 18 bytes if stored in a nvarchar field. If you were half careful, this could be stored in a 9 byte varchar field. A int datatype could store this in 4 bytes quite easily. The int datatype is the best performing type in SQL so it is an excellent choice for a key column. The integer variant, bigint and tinyint are also good choices. Choose the size that best meets your data needs. Since we will no longer be using a coding method or actual data to derive a key value, let the computer do all the work and declare it as an identity column that automatically increments to create a unique, sequential value. SQL Server worries about ensuring unique values so you don't have to. Avoid GUID (uniqueidentifier) datatypes since they are difficult to work with and are only necessary under certain replication scenarios.
The Laws of Key Design Summary
- Always add a primary key to all tables: A table without a key is not a table.
- Always use immutable keys: Set it once, don't change it.
- Let SQL Server do the work: Use identity columns.
- Size the
int
key datatype
to the data: Know your data, size it to fit.
History
- 13th March, 2008: Initial post