This is the third part in my series on programming for performance. The first two parts are available here:
What Do You Mean, Normalization is Bad?
Database normalization is a good thing, right? The more normalized a database is, the more flexibility you have with it, without requiring changes to the database. Data integrity is also easier to maintain. Or at least those are some of the assumptions. I'm by no means an expert in normalization, but I do know that in general, normalization is good. But I also know that you can have too much of a good thing. I used to work for a company where there were a couple of really great theoretical programmers. They were experts at designing the way a system should work. The problem was, they really had no idea how programs need to function in the real world. An even bigger problem was that they were the sons of the owner of the company, so there was no telling them they were wrong, and everything had to be done their way, or else. This led to some very “interesting” situations.
We had a very large data warehousing application, used by some of the largest retail chains in the country (and actually, in the world). All of their product data was maintained in our system. The system was a basic Entity-Attribute-Value system with both a Windows and web interface available for maintaining the data. Each product had any number of attributes associated with it, with values that accurately describe the product. Problems started to arise when it was decided that users should be maintained in an EAV system as well. Each user account database record had three fields – userID, username and password. Everything else was stored in an attribute-value system. The problem here was that some of us had to fight to get the username and password added directly to the user record. Oh, I forgot to mention that this database scheme for users was in place not just for the users of the application itself, but also for the public-facing retailer websites driven by this data. This meant hundreds of thousands of users, not just the thousands who used the application. Hundreds of thousands of users, all trying to log in to the websites, and all having to do two lookups against an attribute-value table to find out if the username and password entered are valid. We actually had to write a script to insert the hundreds of thousands of users and then run several performance tests just to prove to these developers that their perfectly normalized database just would not perform when it came to the simple task of logging in.
Joining to 10 Tables Just to Get the Correct Product Name is Not a Good Thing
At one point, the decision was made to internationalize the system, so it could both be used to store multi-lingual data, as well as allow the application to be run in multiple languages. Storing the multi-lingual data really wasn't too hard – you just add a LanguageID
column to the various attribute-value tables, which points to what language that particular value is for. However, internationalizing the application itself was another beast. There were literally thousands of potential attributes that a product could use to describe itself. And every one of these needed to be able to be described in multiple languages. The solution to this was to create an attribute-attribute-value table. Each attribute had its own attribute-value collection, all with multi-lingual support. It definitely was an interesting solution, and in theory, it should have worked great. The problem came down to the use of this table in the applications and in the retailer websites. Adding another layer of joins to an already fairly bloated query for retrieving product data became a big problem.
For the Windows application, it wasn't great, but it was manageable; once the application finally loaded, the data was there and kept in memory, so it ran fine after that. However, it would sometimes take several minutes to actually load up the program. The web application and retailer websites had the exact problems you would expect – trying to load way too much data for a single page. They were both incredibly slow. For the web application, we ended up having to load fewer products at a time and just deal with it that way.
For the retailer websites, that still wasn't going to be good enough. If you want to keep a customer on your site, it needs to be fast, much faster than you can get away with in a custom application that your employees are forced to use. The key thing with the retailer websites was that even though there were thousands of attributes used to describe the various products, only a fraction of those were used in displaying a product to a customer. I believe that there were at most a couple hundred attributes that the websites would ever display, and about a dozen that were used regularly. The solution that we came up with finally, was to denormalize the product data – the dozen common attributes became individual columns in a new table, used just for the websites. The rest of the attributes remained normalized, but we created a copy of the attribute-value table that just had these web-only attributes, to minimize the table size. Miraculously, the websites went from tortoise-slow to cheetah-fast (yes, very technical terms, I know).
Of course, this wasn't going to work with our brilliant theoretical programmers. We couldn't denormalize data tables, it destroys the integrity of the data, even if it was for a read-only purpose like displaying products on a retailer website. It didn't matter that we were copying everything to a separate database just for this purpose and had a batch process to run nightly to update the data as needed, it just wasn't allowed. We did manage to keep our scaled down attribute-value tables, however, so at least the sites ran at an acceptable speed (probably rabbit-fast, to continue the analogy).
And what of these systems now? I'm not completely sure. I know, however, that the retailer websites using this system have all been replaced in the years since I left the company. One of the reasons that I got from someone involved in the process from the retailer end – the sites just did not perform well enough.
Conclusion
I know that it is really cool (?) to have a 6th Normal Form database, but you really need to ask yourself, is this what is best for my users? Or would they be better suited if you sacrificed a little bit of your data integrity for a large increase in application performance? There are definitely other ways to maintain data integrity within an application, but if a query of your data already takes 10 seconds to load, you are severely limiting what you can do to provide a high-performance application. I know that everyone wants to have the absolute best code out there, but sometimes you really do need to sacrifice the internal code quality to an extent to meet a need. Is your code integrity really worth losing potentially huge clients? For me, the answer is almost invariably no.
Posted to Blog