String Aggregation in the World of SQL Server

Sergii Syrovatchenko

4.56/5 (14 votes)

4 Dec 2013CPOL2 min read

132.1K

This article includes examples of T-SQL capabilities that allow to perform string concatenation flexibly and effectively through the use of different constructs.

Introduction

This articles includes examples of T-SQL capabilities that allow to perform string concatenation more flexibly and effectively through the use of other constructs.

Background

In practice, tasks within the scope of concatenating strings into one come across quite often. It is a pity that the T-SQL standard does not provide for the possibility of using string data inside the aggregating SUM function:

Msg 8117, Level 16, State 1, Line 1 
Operand data type char is invalid for sum operator.

As a solution for such tasks, there was added a GROUP_CONCAT function for MySQL, and LISTAGG – for Oracle. SQL Server, in turn, has no such built-in functionality.

However, this shouldn't be considered as a disadvantage, since T-SQL capabilities allow to perform string concatenation more flexibly and effectively through the use of other constructs that will be revealed below.

Using the code

Suppose we need to concatenate several strings separated by comma in one, using data from the following table:

SQL

IF OBJECT_ID('dbo.Chars', 'U') IS NOT NULL
    DROP TABLE dbo.Chars
GO
CREATE TABLE dbo.Chars ([Char] CHAR(1) PRIMARY KEY)
INSERT INTO dbo.Chars ([Char])
VALUES ('A'), ('B'), ('C'), ('F'), ('D')

The most obvious solution for this task is using of a cursor:

SQL

DECLARE 
      @Chars VARCHAR(100)
    , @Char CHAR(1)
DECLARE cur CURSOR LOCAL READ_ONLY FAST_FORWARD FOR
    SELECT [Char]
    FROM dbo.Chars
OPEN cur
FETCH NEXT FROM cur INTO @Char
WHILE @@FETCH_STATUS = 0 BEGIN
    SET @Chars = ISNULL(@Chars + ', ' + @Char, @Char) 
    FETCH NEXT FROM cur INTO @Char
END
CLOSE cur
DEALLOCATE cur
SELECT @Chars

Using of a cursor itself causes reduced efficiency of query execution and, at least, doesn't look very elegant.

To avoid the use of a cursor, the strings can be concatenated via variables assigning:

SQL

DECLARE @Chars VARCHAR(100)
SELECT @Chars = ISNULL(@Chars + ', ' +[Char], [Char])   
FROM dbo.Chars
SELECT @Chars

On the one hand, the construct is rather simple; on the other hand, its performance on a large fetch leaves much to be desired.

Thus, this method will be insolvent in case when concatenation inside a SELECT block is required.

In such cases, XML features are resorted to, using the following construct:

SQL

SELECT Chars = STUFF((
	SELECT ', ' + [Char]
	FROM dbo.Chars
FOR XML PATH(''), TYPE).value('.', 'VARCHAR(MAX)'), 1, 2, '')

If we look at the execution plan retrieved in dbForge Studio for SQL Server, – we can notice high query cost for the value method:

To avoid this operation, the query can be rewritten using XQuery properties:

SQL

SELECT Chars = STUFF(CAST((
	SELECT [text()] = ', ' + [Char]
	FROM dbo.Chars
	FOR XML PATH(''), TYPE) AS
VARCHAR(100)), 1, 2, '')

As a result, we will get a very simple and fast execution plan:

In principle, string concatenation by a column does not cause special difficulties.

A more interesting situation is when concatenation of several columns is needed. For example, we have the following table:

SQL

IF OBJECT_ID('dbo.EntityValues', 'U') IS NOT NULL
    DROP TABLE dbo.EntityValues
GO
CREATE TABLE dbo.EntityValues (
      EntityID INT
    , Value1 CHAR(1)
    , Value2 CHAR(1) 
)
CREATE NONCLUSTERED INDEX IX_WorkOut_EntityID
	ON dbo.EntityValues (EntityID)
GO 
INSERT INTO dbo.EntityValues (EntityID, Value1, Value2)
VALUES 
    (1, 'A', 'X'),
    (2, 'B', 'Y'),
    (2, 'C', 'Z'),
    (2, 'F', 'H'),
    (1, 'D', 'R')

in which data must be grouped in the following way:

Alternatively, we can copy XML calls, but then we get re-reading, which may significantly affect the query execution efficiency:

SQL

SELECT 
      ev.EntityID
    , Values1 = STUFF(CAST((
        SELECT [text()] = ', ' + ev2.Value1
        FROM dbo.EntityValues ev2
        WHERE ev2.EntityID = ev.EntityID
        FOR XML PATH(''), TYPE) AS VARCHAR(100)), 1, 2, '')
    , Values2 = STUFF(CAST((
        SELECT [text()] = ', ' + ev2.Value2
        FROM dbo.EntityValues ev2
        WHERE ev2.EntityID = ev.EntityID
        FOR XML PATH(''), TYPE) AS VARCHAR(100)), 1, 2, '')             
FROM ( 
    SELECT DISTINCT EntityID
    FROM dbo.EntityValues
) ev

This can be easily convinced, if to look at the execution plan:

To reduce re-reading, a small XML hack can be used:

SQL

SELECT
      ev.EntityID
    , Values1 = STUFF(
        REPLACE(
            CAST([XML].query('for $a in /a return xs:string($a)') 
        AS VARCHAR(100)), ' ,', ','), 1, 1, '') 
    , Values2 = STUFF(
        REPLACE(
            CAST([XML].query('for $b in /b return xs:string($b)') 
        AS VARCHAR(100)), ' ,', ','), 1, 1, '') 
FROM (
    SELECT DISTINCT EntityID
    FROM dbo.EntityValues
) ev
CROSS APPLY (
    SELECT [XML] = CAST((
        SELECT 
              [a] = ', ' + ev2.Value1
            , [b] = ', ' + ev2.Value2
        FROM dbo.EntityValues ev2
        WHERE ev2.EntityID = ev.EntityID
        FOR XML PATH('')
    ) AS XML)
) t

But this query won't be optimal as well due to multiple calls of the query method.

Alternative way – using a cursor:

SQL

IF
OBJECT_ID('tempdb.dbo.#EntityValues') IS NOT NULL
	DROP TABLE #EntityValues
GO
SELECT DISTINCT
	  EntityID
	, Values1 = CAST(NULL AS VARCHAR(100))
	, Values2 = CAST(NULL AS VARCHAR(100))
INTO #EntityValues
FROM dbo.EntityValues
DECLARE
	  @EntityID INT
	, @Value1 CHAR(1)
	, @Value2 CHAR(1)
DECLARE cur CURSOR LOCAL READ_ONLY FAST_FORWARD FOR
    SELECT
	      EntityID
	    , Value1
	    , Value2
    FROM dbo.EntityValues
OPEN cur
FETCH NEXT FROM cur INTO
	  @EntityID
	, @Value1
	, @Value2
WHILE @@FETCH_STATUS = 0 BEGIN
UPDATE #EntityValues
    SET 
          Values1 = ISNULL(Values1 + ', ' + @Value1, @Value1) 
        , Values2 = ISNULL(Values2 + ', ' + @Value2, @Value2)
    WHERE EntityID = @EntityID
FETCH NEXT FROM cur INTO
          @EntityID
        , @Value1
        , @Value2
END
CLOSE cur
DEALLOCATE cur
SELECT *
FROM #EntityValues

However, as practice has shown, when working with large ETL packages, the most efficient solution is the ability to assign variables in the UPDATE statement:

SQL

IF
OBJECT_ID('tempdb.dbo.#EntityValues') IS NOT NULL
	DROP TABLE #EntityValues
GO
DECLARE
      @Values1 VARCHAR(100)
    , @Values2 VARCHAR(100)
SELECT
      EntityID
    , Value1
    , Value2
    , RowNum = ROW_NUMBER() OVER (PARTITION BY EntityID ORDER BY 1/0)
    , Values1 = CAST(NULL AS VARCHAR(100))
    , Values2 = CAST(NULL AS VARCHAR(100))
INTO #EntityValues
FROM dbo.EntityValues
UPDATE #EntityValues
SET 
      @Values1 = Values1 =
        CASE WHEN RowNum = 1 
            THEN Value1
            ELSE @Values1 + ', ' + Value1 
        END
    , @Values2 = Values2 = 
        CASE WHEN RowNum = 1 
            THEN Value2
            ELSE @Values2 + ', ' + Value2 
        END
SELECT
      EntityID
    , Values1 = MAX(Values1) 
    , Values2 = MAX(Values2)
FROM #EntityValues
GROUP BY EntityID

SQL Server has no built-in alternative to GROUP_CONCAT and LISTAGG functions. Nevertheless, this doesn't prevent performing string concatenation tasks efficiently. The objective of this article is to show it clearly.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)