How to Format Repeating Values in SQL Result Set

Wendelius

5.00/5 (8 votes)

13 Oct 2016CPOL2 min read

14.1K

A little tip on how to format repeating values in SQL output using LAG function

Download source code - 514 B

Introduction

Overall, formatting the output data from the database server should never be done in SQL. There are a lot of better tools and languages to do the formatting. SQL is primarily designed for data retrieval and manipulation, not for creating fancy output.

Having that said, a question that often arises is, "How can I remove repeating values for a column in my result set?". An easy answer lies in the LAG function: LAG (Transact-SQL). The LAG function retrieves the value from the previous row in the result set using a defined order. By using this function, we can compare the value from the previous row to the value on the current row. Let's take an example.

The Code

First, we need a test table and some data. In order to have those, run the following statements:

SQL

CREATE Table MyReportData (
   Item     varchar(100) NOT NULL,
   SubItem1 varchar(100) NOT NULL,
   SubItem2 varchar(100) NULL
);
GO

INSERT INTO MyReportData (Item, SubItem1, SubItem2) VALUES
('First item',  'Sub item 1', NULL),
('First item',  'Sub item 2', 'Extra data'),
('First item',  'Sub item 2', 'Additional data'),
('First item',  'Sub item 3', NULL),
('Second item', 'Sub item 1', 'Info'),
('Second item', 'Sub item 2', 'Data'),
('Second item', 'Sub item 3', NULL),
('Third item',  'Sub item 1', 'Nonsense'),
('Third item',  'Sub item 1', 'Even more nonsense'),
('Second item', 'Sub item 2', NULL);
GO

Now if we just fetch the data using query like this:

SQL

SELECT Item,
       SubItem1,
       SubItem2
FROM   MyReportData
ORDER BY Item,
         SubItem1;

We get the following result:

SQL

Item         SubItem1     SubItem2
----------   --------     --------
First item   Sub item 1  
First item   Sub item 2   Extra data
First item   Sub item 2   Additional data
First item   Sub item 3  
Second item  Sub item 1   Info
Second item  Sub item 2   Data
Second item  Sub item 2  
Second item  Sub item 3  
Third item   Sub item 1   Nonsense
Third item   Sub item 1   Even more nonsense

If the goal was to remove the repeating values in the first two columns, we can do it by running the following query:

SQL

SELECT CASE
          WHEN Item = LAG(Item) OVER (ORDER BY Item, SubItem1) THEN NULL
          ELSE Item
       END AS Item,
       CASE
          WHEN SubItem1 = LAG(SubItem1) OVER (ORDER BY Item, SubItem1) THEN NULL
          ELSE SubItem1
       END AS SubItem1,
       SubItem2
FROM   MyReportData
ORDER BY Item,
         SubItem1;

The CASE structure compares the value from the previous row to the current value. If the value is the same, then return NULL but if the value differs, then return the value itself. The previous row is defined by ordering the data:

First by Item column
Then by SubItem1 column

Now the result would be:

SQL

Item         SubItem1     SubItem2
----------   --------     --------
                          Additional data
                         
                          Even more nonsense
             Sub item 2   Data
             Sub item 2   Extra data
             Sub item 3  
             Sub item 3  
First item   Sub item 1  
Second item  Sub item 1   Info
Third item   Sub item 1   Nonsense

While the data is correct, we no longer have repeating values, it isn't quite what we expected. The rows are obviously in the wrong order.

So why is this happening? The reason is that I gave an alias name for the first two columns and I used the exact same name as an alias that is the name of the original column. So now the ORDER BY clause is actually referring to the aliases, not to the original columns. When the statement is executed, the sorting is done with the modified values from the CASE expressions, not with the original values.

Now when we understand the reason, it's easy to fix this problem. Let's just use some other alias name for columns and ensure that the ordering of the result set still takes place based on the original values. Like this:

SQL

SELECT CASE
          WHEN Item = LAG(Item) OVER (ORDER BY Item, SubItem1, SubItem2) THEN NULL
          ELSE Item
       END AS FormattedItem,
       CASE
          WHEN SubItem1 = LAG(SubItem1) OVER (ORDER BY Item, SubItem1, SubItem2) THEN NULL
          ELSE SubItem1
       END AS FormattedSubItem1,
       SubItem2
FROM   MyReportData
ORDER BY Item,
         SubItem1;

And now the result is:

FormattedItem   FormattedSubItem1   SubItem2
-------------   -----------------   --------
First item      Sub item 1          
                Sub item 2          Extra data
                                    Additional data
                Sub item 3          
Second item     Sub item 1          Info
                Sub item 2          Data
                                    
                Sub item 3          
Third item      Sub item 1          Nonsense
                                    Even more nonsense

To be honest, this isn't a foolproof solution yet. The ordering is done only by first two columns and they may contain the same values. Because of this, the ordering of individual rows having the same data in two first columns may change depending on how the data is fetched and handled by SQL Server. In the real situation, it would be a good idea to add a third, uniquely valued column to the ORDER BY clauses to define an exact order.

References

History

13^th October, 2016: Created

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)