Introduction
In regards to database application design authority there are three basic situations:
- You own both application and database.
- You are developing for an application with negotiable database changes.
- You are developing for an application with a non-negotiable database.
This article is geared towards scenario #1. I know that there's a lot of overlap between all three scenarios, but there are definitely differences. You could be working with a "DBA" who declares unique constraints are the same thing as primary keys therefore breaking your ORM tool in turn causing you to have to make drastic adjustments to code (yes that has happened to me).
When desiging a database application the database and model design should go hand in hand. It's not that these two, seperate pieces should be coupled but if you design them with consideration for each other the overall quality of design and ease of development on both sides improves.
The SOLID programmer in you might be questioning this, so to further this point think about the converse... Think about what can happen when the model and database are extremely disjointed... Yeah, things can go bad quick...
The article is going to be covering several of the common design considerations of entities and their respective models. The goal is to present common approaches to solve some of those fundamental considerations.
Background
If you're not familiar with Martin Fowler go research his work a bit... Martin Fowler is one of the largest contributers to modern architecture, particularly the MV* patterns, as well as modern software development practices in general. As a specific example of his contributions, Martin Fowler was one of the original signators of the Agile Manifesto.
Martin Fowler also wrote Practices of Enterprise Application Architecture (PEAA) back in 2002. It's largely based on DDD which he's not the orginator of, but the patterns and practices he covers in the book on how to implement DDD contributely directly to many of the latest industry patterns. For example, DDD covers the Repository object type and Martin Fowler's implementations helped evolve the modern Repository Pattern.
If you're not familiar with DDD it would be good to spend some time spinning up on it before you read the article too. I'm going to cover a bit of it in the article, but understanding the parts I don't cover will aid in understanding the material.
The design recommendations in the article are founded in the recommendations of PEAA and DDD, but they're geared towards MVVM and specific database implementation details. Both PEAA and DDD were written well before the MVVM and IOC boom. That being said it does cause some slight divergence from PEAA and DDD in the article's recommendations.
To SQL, or to NoSQL?
When PEAA was written enterprise application databases weren't limited to, but pretty much between, MySQL, Oracle, SQL Server which are all relational databases. Most of the literature is going to be based on horizontally oriented database concepts; however, databases have expanded greatly since with both big data and NoSQL.
NOTE: I consider Informix and DB2 to be red-headed stepchildren.
I can't really speak to big data considerations because that's a realm I'm not involved in, but SQL or NoSQL still have the same essential design considerations when coordinating with models. Schema or not, in both cases you're still deciding what goes into an entity. Schema or not, in both cases you're still going to have to map the database model to/from the business model. Schema or not, how you coordinate them can influence the ease or difficulty of your work.
Schema-less doesn't apply to just NoSQL either. Even in relational databases you can create a vertical, a.k.a. Entity-Attribute-Value, database. It creates similar levels of flexibility in what an entity contains without a discrete column to represent a field in that entity.
Schema-less databases are typically harder to work with from the MVVM perspective. Basically the more that your database model diverges from the business model the more work there is to do mapping them together. Additionally those mappers will have to handle all the various conditions that could arise from that flexibility.
If you're going to pick a schema-less database just make sure that the payoff is there... If you don't have a really solid reason not to use a relational database then just default to a relational database.
Primary Keys
One of the first decisions designing a table is to pick the primary key. For some database 101 book knowledge there are three classifications of candidate primary keys:
- Natural
- Composite
- Surrogate
The default recommendation for the modern application is going to be surrogate keys. A surrogate key is optimal for ORM tools and database security. While the majority of tables don't need security around the primary key, it's also a choice that leads to consistency throughout design - a surrogate key should always be able to be created. It also shouldn't need to change or be modified once it's assigned which could happen using natural data.
Whatever you do, don't use composite keys. On top of just being more difficult to work with on both sides, some ORM tools handle composite keys very poorly.
So now that we're settled on surrogate primary keys being the best, the next decision is going to be the data type of the surrogate key. There's two basic options:
Unless there are extreme performance considerations that need to be met, you should default to a Guid
because:
- A
Guid
naturally supports concurrent inserts. - A
Guid
naturally supports uniquely identifying any entity.
For those dealing with concurrency issues this natural support is huge. Coordinating seed/key values for numerics is a royal pain in the ass at times and potentially causes additional delays and contention on the application side. Use a Guid
- BAM done.
If you're working on a really small application without any concurrency issues then a Guid
can be a little overkill; however, in that case you probably don't have to worry about performance from an otherwise well designed database. Still default to a Guid
.
Regarding unique identification, if I want to check if my model object is equal to another model object I just have to compare the primary key if it's a Guid
. For applications that may have collections of different models this is really helpful. If you're using a numeric value you've got to write additional code because you can obviously have duplicate keys between entities. Not the biggest of issues to overcome there, but I prefer not having to do that.
For very large tables the Guid
can reduce performance. It's going to reduce the amount of records a single page can hold therefore reducing the performance of the index. In cases of a table with millions of records under high contention a seeded long
could definitely be the way to go.
The exception on when to default to a Guid
is when you're creating a small reference table. In that case definitely use a numeric primary key.
Whenever using a numeric value for a primary key just make sure that you pick a size that is respective to the amount of records you'll be creating in the table. Use a smallint
for enum
equivalents, etc. Doing so will help above mentioned page sizes and performance.
Models
In DDD there are three types of models:
- Entity
- Aggregate
- Value Object
Entity objects are often referred to as domain level objects or top level objects. Aggregates are collections of object that have a common root Entity and don't really make a whole lot of sense outside of that Entity. Value Objects are immutable objects such as enumeration values.
NOTE: The terminology can be a little confusing because entity is also a general term for class and/or database table. I'll try to make this distinguishable by using a capitalized Entity when referring to the DDD Entity.
When designing a model an Entity should never contain another Entity. The Entity should contain only references to the other Entity. If a mutable Entity was stored in multiple models then each model containing the other Entity would need to update/maintain that Entity. That could require one to thousands of objects to synchronize their updates!
An Entity can and should contain Aggregate objects. Aggreates are a collection of objects with a common root. On top of Aggreates not making a lot of sense outside of the Entity, it helps coerce database optimization through deep reads. If the Entity had to individually retrieve it's Aggregate objects it could easily, and unnecessarily, inflate database round trips.
Below are examples of Entities and Aggregates:
public class Recipe
{
public Guid RecipeId { get; set; }
public string Name { get; set; }
public List<RecipeStep> RecipeSteps { get; set; }
}
public class RecipeStep
{
public Guid RecipeStepId { get; set; }
public Guid RecipeId { get; set; }
public Guid UtensilId { get; set; }
public int Order { get; set; }
public string Name { get; set; }
public string Instructions { get; set; }
}
public class Utensil
{
public Guid UtensilId { get; set; }
public string Name { get; set; }
}
In the example of Aggregates, what business value does having an individual RecipeStep
provide? None really... You're never going to execute a RecipeStep
outside of the context of that Recipe
.
In the case of a Utensil
though, you might have a Kitchen
object that stores all the Utensil
objects you have available. The application could allow a user to manage what Utensil
s they have in their Kitchen
. That way the application could filter the Recipe
s that you see based on what you have or highlight the Utensil
s you need to get or substitute. So in the case of the Utensil
, it could be referenced in multiple objects outside the scope of the RecipeStep
.
I want to take a moment to distinguish between an Aggregate and an Entity in a collection. While a Kitchen
could have a collection of Utensil
s, a Utensil
is not an Aggregate. It's very important in this case to model the Kitchen
as such:
public class Kitchen
{
public Guid KitchenId { get; set; }
public List<Guid> Utensils { get; set; }
}
The big deal about this distinction is that since both the RecipeStep
and Kitchen
have a reference to the Utensil
they would both be responsible for maintaining that object and that's not good. It might not seem like a big deal at first, but after an application continues to grow and two references becomes ten it can severely hinder development.
Entities can also contain Value Objects, but as mentioned Value Objects aren't mutable so they don't need to be updated or maintained which is the big problem with Entities containing other Entities. Value Objects are most closely related to reference tables in the database and there are two basic ways to implement them in the model as shown below:
NOTE: Hopefully it goes without saying (although I'm saying it now) that in the real world you would pick one implementation - I'm well aware that the enum
and class
have a conflicting name.
public class Utensil
{
public Guid UtensilId { get; set; }
public UtensilType UtensilType { get; set; }
public int UtensilTypeId { get; set; }
public string Name { get; set; }
}
public enum UtensilType
{
General = 0,
Mixer = 1,
Turner = 2,
Cutter = 3
}
public class UtensilType
{
public int UtensilTypeId { get; set; }
public string Name { get; set; }
}
Typically speaking the enum
implementation is going to be optimal for simple, relatively static enumerations. In the case of a Utensil
it should be relatively static, but not necessarily simple. How do you describe a spork? How about a garlic press? A better example of a simple and static enumeration is car body types. The car industry occassional comes up with new body types (i.e. crossover) but it's a simple classification that doesn't change often.
Typically speaking the class
implementation is going to be optimal fo complex or changing enumerations. The class
allows you to associate metadata with the enumeration value. You can also dynamically pull the values from a repository which won't require code changes if the enumerations change. But how do you declare a default? How do you restrict values in the application?
In the case of a UtensilType
I don't think there's a definitive best way, but each approach most certainly has clear pros and cons. Pick the best approach per Value Object.
In both cases the important distinction between Value Objects and Entities is they are immutable. A mixer will never become a cutter, a hatchback will never become a convertible. Now you could make a shank from a KitchenAid mixing bowl I bet, but that's changing the mixer not the type. Just like with a car you can saw off the top and make it a convertible, but a hatchback is still a hatchback and a convertible is still a convertible.
In all these approaches to designing models regardless of what DDD type, the models are basically DTOs. It's recommended models be at least close to POCOs (or POJOs) because:
- Models are shared in common assemblies and should not be technology dependent.
- Models are often transferred needing to be serializable and smaller in size.
If we look at all the model examples there aren't any objects from an ORM, there aren't any attached events, there aren't any dependencies on a container, etc. That makes the model as widely distributable as possible.
The wide distribution and implementation guideline requires the database model and business model be different classes. It's good overall for many reasons including seperation of concerns and optimization, but if the database model was used as the business model then consumers of those models would be forced to include the database/ORM libraries as well making sharing models technology dependent.
What if you're using an different version of the ORM? What happens if you're working on a platform the model technologies didn't support? These issues could render a model library incompatible.
Events and Services in DDD are the objects that will control the behavior of the models and handle interactions between models. As an important note, Services in DDD are not necessarily external services such as a RESTful API. They include internal objects acting on models in the business layer as well. By leaving these functions out of the model we can reduce and/or eliminate technology dependencies.
NOTE: There is also the Factory object type but I'm purposely leaving it out because the modern application favors IOC and composition.
Auditing
A very common requirement for enterprise applications is to be able to produce audit trails of changes. Many developers will put audits/history in the same table/model as the original source; however, I highly encourage that historical records are kept in seperate models and seperate tables.
Benefits of this approach:
- You can add audit/history as needed without impacting the function of the current table.
- It simplifies application logic.
- Database performance improves for working with current data.
- You can minimize database memory by recording only the fields that can be edited.
Let's dig into these a little bit...
If I have a table designed as a current table then later decide to add history, if my history is completely seperated then no big deal - my current application logic stays the same. If the decision was made to include history I would have to refactor the existing logic to disregard the history when looking for the current entry.
Particularly for live and volatile data applications, current entries are usually few to the many compared to history. In those cases having millions of records compared to as little as hundreds can put a large, unnecessary strain on the database.
There are actually two levels of audits/history - editing the current record and editing a historical record. For any given model you have to decide what you need obviously, but following suite with the above recommendation each level is it's own table/model.
I will refer to the first level as historical and the second level as audit. Reason being is that historical records may need to be maintained just from system changes alone - not necessarily requiring a full audit. Auditing implies the historical record can be manipulated and therefore you need to track that as well. In my head anyway, auditing is about a user changing historical record and not normal data flow from the system.
With that being said, in order for this approach to work you have to create a historical record every time you add or update a current record. Similarly you have to create an audit record every time a historical record is updated if you're auditing the historical record.
Below is what this looks like in the Model:
public abstract class EditableEntity
{
public DateTime CreatedDateTime { get; set; }
public DateTime? UpdatedDateTime { get; set; }
public Guid CreatedBy { get; set; }
public Guid? UpdatedBy { get; set; }
public string Notes { get; set; }
public EditableEntity()
{
CreatedBy = User.SystemUser.UserId;
CreatedDateTime = DateTime.UtcNow;
}
}
public abstract class HistoryEntity : EditableEntity
{
public HistoryEntity(EditableEntity editableEntity, Guid userId)
: this()
{
CreatedDateTime = editableEntity.CreatedDateTime;
CreatedBy = editableEntity.CreatedBy;
UpdatedDateTime = editableEntity.UpdatedDateTime;
UpdatedBy = editableEntity.UpdatedBy;
Notes = editableEntity.Notes;
}
}
public abstract class HistoryAuditEntity
{
public DateTime UpdatedDateTime { get; set; }
public Guid UpdatedBy { get; set; }
public string Notes { get; set; }
public HistoryAuditEntity(HistoryEntity historyEntity)
: this()
{
UpdatedBy = historyEntity.UpdatedBy.Value;
UpdatedDateTime = historyEntity.UpdatedDateTime.Value;
Notes = historyEntity.Notes;
}
}
Quick comments about this model structure:
- The database would reflect the same structure.
- Whether or not update is nullable or set to the created info could be argued either way.
- I include
Notes
because in general when people change something you want to allow them to provide details about the change.
When you're auditing you need to be able to associate a User
to a change. You may have noticed that the primary key is once again a Guid
and not a SID or string
... Because there can be multiple and/or changing sources for users I highly recommend that your application store users in it's database. You can put whatever external references you want in the entity, but Guid
s in your own database guarentees uniqueness and application consistency.
In the code example the SystemUser
is a static class property with a default, randomly assigned Guid
. I highly discourage using Guid
.Empty as a default or the SystemUser
.UserId because you can be erroneously assigning the SystemUser
when it was actually a missed user assignment/bug.
We're about to get into the Repository Pattern, but I've got to briefly mention it here first... How history and audit records are controlled should be contained within the Repository implementation. Think about it... If it's not contained within the Repository then anywhere calling the Repository then must:
- Be aware if a model has/requires a historical and/or audit entry.
- Be aware how to coordinate their update (may include transactions, etc).
- Create it's own code to create and insert the historical and/or audit entry if needed.
It turns into a big violation of GRASP if the Repository is not handling it...
Repository Pattern
The repository pattern starts by implementing an interface
with the desired operations (e.g. CRUD) returning the common model objects. The interface
should be declared at the same level as the models so it can be shared as well.
NOTE: It's common for repository interface
s and models to be declared in the same assembly.
Absolutely do not put your implemenation of the Repository in the same assembly. Again it might seem like "common code" but it's going to create a technology dependency. Additionally there are numerous cases where custom repository work is required so it should be seperated.
From a modular approach my preferences is to have the Repository implementation in it's own library. That way if there is a need for multiple implementations the consumer can just pick which one it needs without having to worry about pulling everything out or refactoring.
NOTE: Putting the interface and implementation in the same library defeats much of the purpose of IOC - don't do it.
Outside of CRUD, the biggest responsibility of the Repository is mapping business models to/from the database models. There are plenty of tools you can use to do automatic mapping (i.e. AutoMapper); however, I'm going to say right now I strongly discourage using such tools for the Repository.
Pros of a mapper:
Cons of mapper:
- Less efficient.
- Potentially many issues needing custom mapping overrides.
I used to default to using mappers, but there is a lot of pain when a database design change causes an unseen/undetected conflict. As such I have started defaulting to writing my own mapping code. It's tedious yes, but after the initial investment it's tends to be safer. That being said I still use mapping tools for the rest of the application.
As mentioned another responsibility of the Repository is going to be handling historical and auditable entries. In those cases the implementation logic should follow the steps below:
- Query for existing record
- If record exists, check created/updated timestamp.
- If it's a new or updated value add/update current record.
- Create a historical record.
NOTE: If you're using a transactional database these should obviously be covered in the same transaction.
An important part to discuss is step #2. I know it doesn't cover all enterprise applications, but the enterprise applications I have dealt with have a lot of live data from a lot of external sources. That means there's a potential to receive stale data.
Take for example an application reporting stock prices... The value is could be reported via a UDP transport with potentially millisecond frequency changes. It's possible the application can recieve an old value reported after the latest value is recieved. If my application wants to record the historical value (yes I know this information is recorded in many available sources but bear with me for the example) I don't want to automatically assume whatever I receive is actually the latest and update the current record. I need to check it and update it as described.
The big picture here is that logic to coordinate between entities in the database should reside within the Repository implementation. I know this isn't too far of a stretch from what most people implement in their data layer already, just want to make sure that's clear.
Storing Computational Data
When storing computational data there are a couple rules that make life a lot easier:
- For any give unit type (i.e. percentage, temperature, pressure, weight) store the same unit accross the entire database.
- When storing a
DateTime
always store it in UTC.
Depending on what data sources your work is dealing with, saving data to the database from the model can require a good amount of work converting; however, the alternative can be chaos. This weight in this field is in kilograms, this weight in this field is in pounds, okay before I add two weight fields I need to convert them appropriate... NO! Just... NO!
You might think that the units your company is working with are distinct enough (let's say kg and mg) where you'll "never have an issue", but trust me... It can really bite you in the ass when years down the road you realize you have outliers and there's no good way to tell them apart. You have no idea if it caused any issues and you have no way to tell if it's impacting anyone right now. Sounds fun right?
It's a similar train of thought by always storing UTC. Your model and queries will never need to try to figure out what it is... Query logic and business logic can work seamlessly together...
But why not store everything in local time even if all your data is viewed in local time?
First and foremost local time that applies daylight savings time (DST) isn't unique. The potential for conflict exists and once again like with units, once you're far enough removed from when the data was created it may be impossible to find out.
The other big issue is that local time is in respect to wherever your servers and/or database resides. What happens if they move timezones? What happens if they span different timezones? What happens if you get users in multiple timezones?
I'll tell you what happens... It becomes a mess, then everyone is trying to figure out how to migrate the data and refactor to use UTC.
If all computational data is consistent then database queries and business logic can freely work with data from all sources. There may be standard units that the user expects the data to be in, but it's really easy to convert when you're producing a report or displaying in the UI.
The problem with this approach is that there is nothing you can do on the database side or the model side to prevent data from being stored in alternative units. It's got to be established and agreed on both sides the chosen unit will be maintained, but again a key point on the importance of coordinating the database and model.
You might jump in at this point and make the esteemed suggestion of storing a units field. I do agree this is an alternative, but let's discuss what needs to be done to make it work...
The premise of a Unit
table is that any record of data can store an additional field to designate what Unit
the data is in. The immediate issue is it can double the amount of fields you have to work with for a record of data. Due to that consideration the approach can decrease productivity.
All work will potentially have to run conversions between records and particularly with high volume or high frequency data conversion costs start to really add up. Additionally it's increasing the size of any data sent over the wire too. Due to these two considerations it doesn't perform as well.
If you think this approach is a fail safe against accidently storing the wrong unit in a field... Well... Not that either... First off this is trusting the data is accurately represented by the unit. Secondly this is trusting that work updating the data is updating in the same unit or synchronizing the unit with the new value.
Weighing each approach in my opinion there's no real added value of stroring a Unit
compared to having a default unit. I'm not going to say that storing a Unit
is an anti-pattern, but it definitely tends to be a lot more work then using defaults. The latter might be you're only option though if there are multiple teams that refuse to coordinate with each other on the issue.
In closing I want to say this is an area that I have seen wreck applications and productivity - treat it with high priority. If the only thing preventing common units is a meeting, or taking ownership of an area, then just get it done.
Summary
As a fun note I wrote this twice because I accidently discarded my draft... HA!
As such I'm going to through this article out there at present time without a sufficient summary.
If there's any major discussion points or requests I'll update the article accordingly.
Please share your thoughts, comments, and insults!
History
- 2017-05-25 Original release.
- 2017-05-25 Adding unit topic (can't believe I forgot that).