Data problems are often hard to notice when operating with SQL. With other languages like C, C#, Java, etc., information is typically investigated and modified item-by-item. In SQL, all the operations are set based, meaning that a single statement often deals with several rows. For example, a simple SELECT
statement investigates all the rows in involved tables, one way or another, and returns the desired results, hopefully...
Syntactical problems are easy to catch, you get an error. But what if your conditions are false, you don’t get an error but the results aren’t what were expected. When creating a program or modifying it, part of the testing is to ensure that the returned results are correct so this way we can deal with the initial situation. However, things get more complicated when the world around changes. What if the database structure is changed in the future in a way that would have an undesired effect on your existing SQL statements?
Let’s take an example. The following script creates an OrderItem
table. It’s not a gem what comes to database design but sufficient to pinpoint the problem we’re discussing of:
CREATE TABLE OrderItem (
OrderNumber int,
Product varchar(100),
RowType int CONSTRAINT ck_orderitem_rowtype
CHECK (RowType IN (1,2)),
Amount int,
ListPrice decimal
);
Product
is the name of the item to be sold, Amount
defines the quantity of items to be purchased and ListPrice
is the price of a single item. RowType
defines if this item is included in the order (value is one) or was it selected but then removed from the order (value of two).
Now, let’s add some data.
INSERT INTO OrderItem
(OrderNumber, Product, RowType, Amount, ListPrice)
VALUES (1, 'Squared tyre', 1, 4, 100.0);
INSERT INTO OrderItem
(OrderNumber, Product, RowType, Amount, ListPrice)
VALUES (1, 'Oil scent, 100 oz', 2, 1, 50.0);
INSERT INTO OrderItem
(OrderNumber, Product, RowType, Amount, ListPrice)
VALUES (1, 'Haircut', 1, 1, 20.0);
Now, let’s imagine that the application fetches a list of items in the order. In order to show the prices correctly, the SELECT
statement needs to take the RowType
into account, something like the following:
SELECT CASE oi.RowType
WHEN 1 THEN 'Included into order'
WHEN 2 THEN 'Excluded from order'
END As Status,
oi.Product,
oi.Amount,
CASE oi.RowType
WHEN 1 THEN oi.Amount * oi.ListPrice
WHEN 2 THEN 0
END As Price
FROM OrderItem oi
WHERE oi.OrderNumber = 1
ORDER BY oi.RowType,
oi.Product;
The results would be:
Status Product Amount Price
------------------- ----------------- ------ -----
Included into order Haircut 1 20
Included into order Squared tyre 4 400
Excluded from order Oil scent, 100 oz 1 0
Also, the application could fetch the items to be delivered, with the following query:
SELECT oi.Product,
oi.Amount,
oi.Amount * oi.ListPrice AS Price
FROM OrderItem oi
WHERE oi.OrderNumber = 1
AND oi.RowType = 1
ORDER BY oi.RowType,
oi.Product;
And the results would be:
Product Amount Price
------------ ------ -----
Haircut 1 20
Squared tyre 4 400
So everything is perfect, until…
One day, an extra row type is defined and added into the database. It was decided that customers buying lot of items were rewarded with a gift included in the purchase, naturally free of charge. The gift is a bunch of flowers which is also sold separately. Because of this, a new RowType 3
was introduced to distinguish free gift items from other items.
The following change was made to the database:
ALTER TABLE OrderItem
DROP CONSTRAINT ck_orderitem_rowtype;
ALTER TABLE OrderItem
ADD CONSTRAINT ck_orderitem_rowtype
CHECK (RowType IN (1,2,3));
The order 1
we used as an example had such a big value that it would get a gift so let’s add it:
INSERT INTO OrderItem
(OrderNumber, Product, RowType, Amount, ListPrice)
VALUES (1, 'Bunch of flowers', 3, 1, 5.0);
Now, let’s see what happens with our original queries. The first one would return:
Status Product Amount Price
------------------- ----------------- ------ -----
Included into order Haircut 1 20
Included into order Squared tyre 4 400
Excluded from order Oil scent, 100 oz 1 0
NULL Bunch of flowers 1 NULL
And the second one:
Product Amount Price
------------ ------ -----
Haircut 1 20
Squared tyre 4 400
The first one is more obvious you get NULL
values in the result set in places where they are not expected so if someone looks carefully, this may be noticed.
The second one is more dangerous, since the WHERE
clause restricted the rows only to known RowTypes
the gift isn’t included in the returned set of rows. Just by looking at the data, nothing looks suspicious so you need to know the data behind in order to spot the problem.
If you think about other languages, they have different kinds of mechanisms to handle known and unknown values. For example, a switch
structure has a default block that is fired for all values not listed in case
statements. This same idea can be applied to some extent to queries. However, in order to ensure that false data doesn’t pass through, we won’t do any default actions but deliberately generate errors.
Consider the following statement. It has a small modification in the CASE
structure. RowTypes
one and two are handled correctly but if an unknown RowType
is spotted, the ELSE
branch is executed and a division by zero error is thrown.
SELECT CASE oi.RowType
WHEN 1 THEN 'Included into order'
WHEN 2 THEN 'Excluded from order'
END As Status,
oi.Product,
oi.Amount,
CASE oi.RowType
WHEN 1 THEN oi.Amount * oi.ListPrice
WHEN 2 THEN 0
ELSE 0/0
END As Price
FROM OrderItem oi
WHERE oi.OrderNumber = 1
ORDER BY oi.RowType,
oi.Product;
When the statement is run, you get a message like the following:
Msg 8134, Level 16, State 1, Line 86
Divide by zero error encountered.
That would prevent a situation where an unknown RowType
could cause false data to be returned. But what about the second one, it didn’t have any CASE
structure in the SELECT
. And even if a similar CASE
would be added to the statement, it wouldn’t work since the WHERE
clause takes care that no other values but 1
is returned.
Well, why not add the CASE
into the WHERE
clause:
SELECT oi.Product,
oi.Amount,
oi.Amount * oi.ListPrice AS Price
FROM OrderItem oi
WHERE oi.OrderNumber = 1
AND 1 = CASE oi.RowType
WHEN 1 THEN 1
WHEN 2 THEN 0
ELSE 0/0
END
ORDER BY oi.RowType,
oi.Product;
When the statement above is run, again, a division by zero error is returned.
So, after adding the new RowType
and fixing the statements, the final versions could look something like these:
SELECT CASE oi.RowType
WHEN 1 THEN 'Included into order'
WHEN 2 THEN 'Excluded from order'
WHEN 3 THEN 'Gift'
END As Status,
oi.Product,
oi.Amount,
CASE oi.RowType
WHEN 1 THEN oi.Amount * oi.ListPrice
WHEN 2 THEN 0
WHEN 3 THEN 0
ELSE 0/0
END As Price
FROM OrderItem oi
WHERE oi.OrderNumber = 1
ORDER BY oi.RowType,
oi.Product;
And:
SELECT oi.Product,
oi.Amount,
CASE oi.RowType
WHEN 1 THEN oi.Amount * oi.ListPrice
WHEN 3 THEN 0
ELSE 0/0
END AS Price
FROM OrderItem oi
WHERE oi.OrderNumber = 1
AND 1 = CASE oi.RowType
WHEN 1 THEN 1
WHEN 2 THEN 0
WHEN 3 THEN 1
ELSE 0/0
END
ORDER BY oi.RowType,
oi.Product;
Next time a new RowType is added, both of the statements will react by causing an error and will hopefully get caught already in testing.
So is this the way to always write SELECT
statements? No, it isn’t. The modification in the SELECT
clause is quite safe and can be used more widely but the second one, a CASE
in the WHERE
clause is something to be careful with. Complex structures in conditions may lead into a situation where perfectly valid indexes aren’t used because of an expression. So it’s important to always check how the statement is behaving by investigating the execution plan.
Also, it’s not feasible to use this kind of structure in every place. But in critical statements, it may be justified to add deliberate errors in case of unknown values.