Custom String Split Function with Few Enhancements

Wendelius

5.00/5 (4 votes)

26 Dec 2018CPOL7 min read

9.4K

114

This article shows a custom SQL Server function to split a string delimited by a character or a set of characters. The function also contains a few handy enhancements.

Introduction
CustomStringSplit function
Some tests
CustomStringTableSplit function
Testing CustomStringTableSplit
Ensuring that all rows contain the same number of columns

Introduction

First of all, if you’re looking for a function just to split a string and you’re using SQL Server 2016 or above, consider using STRING_SPLIT (Transact-SQL).

When answering a question recently, I wrote a small function to split a string based on spaces. In order to keep the answer simple, I decided to use recursive CTE (Common Table Expression) for the separation process. With a few enhancements to the answer, the function could be used in several situations like finding individual words from a sentence or splitting a row of CSV data into multiple rows.

The functions presented here contain the following functionality:

Split a string to a set of rows based on a delimiter
Provide ordinal value for each item found
Ability to remove consecutive delimiters if no data is present between them
Remove all occurrences of non-desired characters
Functionality to remove duplicates
If needed, recalculate the ordinals for remaining values
Perform case insensitive comparison for duplicates

CustomStringSplit Function

The complete function looks like this:

SQL

----------------------------------
-- CustomStringSplit
----------------------------------
CREATE FUNCTION CustomStringSplit(@value nvarchar(max),
                                  @delimiter nvarchar(max),
                                  @removeDuplicateDelimiters bit = 0,
                                  @removeChars nvarchar(100) = NULL,
                                  @suppressDuplicates bit = 0,
                                  @recalculateOrdinals bit = 0,
                                  @lowerCaseComparison bit = 0)
RETURNS @items TABLE (
   Ordinal    int,
   Part     nvarchar(max)
) AS
BEGIN
   DECLARE @charCounter int;
   
   -- Remove duplicate delimiters
   IF @removeDuplicateDelimiters = 1 BEGIN
      WHILE CHARINDEX(REPLICATE(@delimiter, 2), @value) > 0 BEGIN
         SET @value = REPLACE(@value, REPLICATE(@delimiter, 2), @delimiter);
      END;
   END;

   -- Remove unwanted characters
   IF @removeChars IS NOT NULL BEGIN
      SET @charCounter = 1;
      WHILE @charCounter <= LEN(@removeChars) BEGIN
         SET @value = REPLACE(@value, SUBSTRING(@removeChars, @charCounter, 1), '');
         SET @charCounter = @charCounter + 1;
      END;
   END;

   -- Split the data into separate rows using a recursive CTE
   WITH Items (Ordinal, Part, Remainder) AS (
      SELECT 1 AS Ordinal,
             CASE
                WHEN CHARINDEX(@delimiter, @value) > 0
                   THEN LEFT(@value, CHARINDEX(@delimiter, @value) - 1)
                ELSE @value
          END AS Part,
          CASE
             WHEN CHARINDEX(@delimiter, @value) > 0
                THEN SUBSTRING(@value, 
                               CHARINDEX(@delimiter, @value) + DATALENGTH(@delimiter) / 2,
                               99999999)
             ELSE NULL
          END AS Remainder
      UNION ALL
      SELECT Items.Ordinal + 1 AS Ordinal,
             CASE
                WHEN CHARINDEX(@delimiter, Remainder) > 0
                   THEN LEFT(Remainder, CHARINDEX(@delimiter, Remainder) - 1)
                ELSE Remainder
             END AS Part,
             CASE
                WHEN CHARINDEX(@delimiter, Remainder) > 0
                   THEN SUBSTRING(Remainder, 
                                  CHARINDEX(@delimiter, Remainder) + DATALENGTH(@delimiter) / 2, 
                                  99999999)
                ELSE NULL
             END AS Remainder
       FROM Items
       WHERE Items.Remainder IS NOT NULL
   )
   INSERT INTO @items (Ordinal, Part)
   SELECT Items.Ordinal, Items.Part
   FROM Items
   OPTION (MAXRECURSION 32767);
   
   -- Remove duplicate values
   IF @suppressDuplicates = 1 BEGIN
      DELETE i FROM @items AS i
      WHERE EXISTS (SELECT 1
                    FROM @items i2
                      WHERE CASE @lowerCaseComparison
                              WHEN 1 THEN LOWER(i2.Part)
                             ELSE i2.Part
                         END = CASE @lowerCaseComparison
                                  WHEN 1 THEN LOWER(i.Part)
                                  ELSE i.Part
                               END
                   AND i2.Ordinal < i.Ordinal);

      -- Recalculate ordinals
      IF @recalculateOrdinals = 1 BEGIN
         UPDATE i
         SET Ordinal = (SELECT COUNT(*)
                        FROM @items i2
                        WHERE i2.Ordinal <= i.Ordinal)
         FROM @items i;
      END;
   END;

   RETURN;
END;

First of all, it returns a table so individual items found from the string will be returned as a result set, a table. This kind of function is a Table Valued Function (TVF). The parameters are:

@value, the string value to split
@delimiter, delimiter to use when splitting the data. Delimiter can be a single character or, for example, a word
@removeDuplicateDelimiters, should consecutive delimiters be removed if no data is present between them?
Value can be either 0 (false) or 1 (true), the default is 0 (false).
@removeChars, what characters should be removed from the result. If the string contains multiple characters, each of them is removed from the string individually.
The default in NULL.
@suppressDuplicates, should duplicate values be removed from the result set?
Value can be either 0 (false) or 1 (true), the default is 0 (false).
@recalculateOrdinals, if duplicate values are removed, should the original ordinal of the first occurrence be retained or should the ordinal numbering be recalculated?
Value can be either 0 (false) or 1 (true), the default is 0 (false).
@lowerCaseComparison, when removing the duplicates, should the comparison be done using lower case (case-insensitivity) or as-is. This is meaningful only if the comparison in the database is case sensitive by default.
Value can be either 0 (false) or 1 (true), the default is 0 (false).

Some Explanations for the Code

SQL

-- Remove duplicate delimiters
IF @removeDuplicateDelimiters = 1 BEGIN
   WHILE CHARINDEX(REPLICATE(@delimiter, 2), @value) > 0 BEGIN
      SET @value = REPLACE(@value, REPLICATE(@delimiter, 2), @delimiter);
   END;
END;

The code above tries to find duplicate delimiters. For example, if the delimiter is ',' the code repeats the delimiter twice resulting into ',,' and replaces all occurrences with a single delimiter. The replacement is done inside a loop since the CHARINDEX function does only a single, forwarding scan on the string so multiple passes are required in case the delimiter is repeated more than twice in the original string.

SQL

-- Remove unwanted characters
IF @removeChars IS NOT NULL BEGIN
   SET @charCounter = 1;
   WHILE @charCounter <= LEN(@removeChars) BEGIN
      SET @value = REPLACE(@value, SUBSTRING(@removeChars, @charCounter, 1), '');
      SET @charCounter = @charCounter + 1;
   END;
END;

Removing the unwanted characters is done by looping through the individual characters and replacing all occurrences with an empty value. In other words, removing them.

SQL

-- Split the data into separate rows using a recursive CTE
WITH Items (Ordinal, Part, Remainder) AS (
   SELECT 1 AS Ordinal,
          CASE
             WHEN CHARINDEX(@delimiter, @value) > 0
                THEN LEFT(@value, CHARINDEX(@delimiter, @value) - 1)
             ELSE @value
       END AS Part,
       CASE
          WHEN CHARINDEX(@delimiter, @value) > 0
             THEN SUBSTRING(@value,
                            CHARINDEX(@delimiter, @value) + DATALENGTH(@delimiter) / 2,
                            99999999)
          ELSE NULL
       END AS Remainder
   UNION ALL
   SELECT Items.Ordinal + 1 AS Ordinal,
          CASE
             WHEN CHARINDEX(@delimiter, Remainder) > 0
                THEN LEFT(Remainder, CHARINDEX(@delimiter, Remainder) - 1)
             ELSE Remainder
          END AS Part,
          CASE
             WHEN CHARINDEX(@delimiter, Remainder) > 0
                THEN SUBSTRING(Remainder,
                               CHARINDEX(@delimiter, Remainder) + DATALENGTH(@delimiter) / 2,
                               99999999)
             ELSE NULL
          END AS Remainder
    FROM Items
    WHERE Items.Remainder IS NOT NULL
)
INSERT INTO @items (Ordinal, Part)
SELECT Items.Ordinal, Items.Part
FROM Items
OPTION (MAXRECURSION 32767);

This is the main thing. The common table expression uses three columns:

Ordinal, each pass increases the ordinal number by one resulting into an incremental row numbering
Part, all the data before the next delimiter in the text
Remainder, remaining text after the next delimiter in the text

For example, if the text would be 'This is a sentence.' and the delimiter is a single space, the query would return the following rows.

Ordinal   Part        Remainder
-------   ---------   ----------------
1         This        is a sentence.
2         is          a sentence.
3         a           sentence.
4         sentence.   NULL

The recursion continues until the value in remainder is NULL and after that, the result is returned to the calling statement. In this case, the statement is an INSERT statement adding the data into the table variable which is going to be returned to the caller in the end.

When creating this, one catch was that I tried to use LEN function. However, LEN does automatically trim trailing blanks so it's not usable in this scenario. When changing LEN to DATALENGTH, the return value must be divided by 2 since the function returns the number of bytes used. For nvarchar, each character uses two bytes.

SQL

DELETE i FROM @items AS i
WHERE EXISTS (SELECT 1
              FROM @items i2
                WHERE CASE @lowerCaseComparison
                        WHEN 1 THEN LOWER(i2.Part)
                       ELSE i2.Part
                   END = CASE @lowerCaseComparison
                            WHEN 1 THEN LOWER(i.Part)
                            ELSE i.Part
                         END
             AND i2.Ordinal < i.Ordinal);

This statement removes the duplicates, either using the default comparison in the database or comparing the lower-case values in order to do a case insensitive comparison, if required.

Few things may look a bit peculiar:

Table alias for the table variable. Since the row from the table variable is referenced in the correlated EXISTS -condition an alias is needed for the table. The syntax to define the alias is
DELETE <table alias> FROM <table variable> AS <table alias>
Comparison in EXISTS. A WHERE clause can contain CASE statements. Normally, these should be avoided because of performance reasons, but in this case, I felt that in majority of the cases, the amount of rows would be so small that the performance penalty is not significant- However, if this causes a problem, this single statement could easily be separated to two different statements.

SQL

-- Recalculate ordinals
      IF @recalculateOrdinals = 1 BEGIN
         UPDATE i
         SET Ordinal = (SELECT COUNT(*)
                        FROM @items i2
                        WHERE i2.Ordinal <= i.Ordinal)
         FROM @items i;
      END;

And the last part recalculates the ordinal numbers of remaining items. If duplicate values are eliminated and recalculation is not done, then the result set contains the ordinal number of the first occurrence for each item.

Some Tests

Separating Words

So let’s run some tests using the function. First, we'll use a small sentence to demonstrate the functionality. Consider the following code:

DECLARE @repeating varchar(max);
SET @repeating = 'You   cannot   end a sentence with because, because because is a conjunction.';
SELECT * FROM CustomStringSplit(@repeating, ' ', DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT);

The result is:

Ordinal   Part
-------   ----------------
1         You
2         
3         
4         cannot
5         
6         
7         end
8         a
9         sentence
10        with
11        because,
12        because
13        because
14        is
15        a
16        conjunction.

So the whole string is split into individual items delimited by a single space. Now, if you want to remove duplicate delimiters with no data in-between and remove points and commas, the call would look like:

SQL

SELECT * FROM CustomStringSplit(@repeating, ' ', 1 , '.,', DEFAULT, DEFAULT, DEFAULT);

And the result would be:

Ordinal   Part
-------   ----------------
1         You
2         cannot
3         end
4         a
5         sentence
6         with
7         because
8         because
9         because
10        is
11        a
12        conjunction

Now we have all words from the sentence. Now if you would like the result to contain each word only once, the call would be for example:

SQL

SELECT * FROM CustomStringSplit(@repeating, ' ', 1 , '.,', 1, 0, 0);

Now the result is:

Ordinal   Part
-------   -------------------
1         You
2         cannot
3         end
4         a
5         sentence
6         with
7         because
10        is
12        conjunction

Notice that while only nine rows are returned, the ordinal numbering goes up to 12, showing the original position of each word. If you would like the ordinal numbering recalculated, the call would be:

SQL

SELECT * FROM CustomStringSplit(@repeating, ' ', 1 , '.,', 1, 1, 0);

And now, the result is:

Ordinal   Part
-------   -------------------
1         You
2         cannot
3         end
4         a
5         sentence
6         with
7         because
8         is
9         conjunction

Using a Word as a Delimiter

For a different type of test, let’s use a single word as a delimiter. Consider the following case:

SQL

DECLARE @repeating varchar(max);
SET @repeating = 'This is first part and then the second and  and '
               + 'after that comes the third and finally the fourth.';
SELECT * 
FROM CustomStringSplit(@repeating, ' and ', DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT);

The call above would split the sentence into five parts. So the result is:

Ordinal   Part
-------   -------------------
1         This is first part
2         then the second
3         
4         after that comes the third
5         finally the fourth.

Splitting a Single Row of CSV Data

Split functions are often used for splitting CSV (Comma Separated Values) data into individual items. You can try with the following:

SQL

DECLARE @singleCsvRow varchar(max);
SET @singleCsvRow = '1,2,3,''A B'',''C D'',1';
SELECT * 
FROM CustomStringSplit(@singleCsvRow, ',', DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT);

The call above would result in:

Ordinal   Part
-------   -------------------
1         1
2         2
3         3
4         'A B'
5         'C D'
6         1

If you’d like to remove the apostrophes from the text data, the call would be:

SQL

SELECT * FROM CustomStringSplit(@singleCsvRow, ',', 1 , '''', DEFAULT, DEFAULT, DEFAULT);

The result is almost the same, only without apostrophes:

Ordinal   Part
-------   -------------------
1         1
2         2
3         3
4         A B
5         C D
6         1

CustomStringTableSplit Function

To take the CSV handling a bit further, the function presented in the beginning can split an individual CSV row. However, typically the requirement is to split all CSV rows for example from a file. In order to do that, let’s create a small utility function to handle multiple rows.

First, we need a type that we can use for a parameter definition:

SQL

----------------------------------
-- StringItems type
----------------------------------
CREATE TYPE StringItems AS TABLE (
  StringItem nvarchar(max)
);

Then the function:

SQL

----------------------------------
-- CustomStringTableSplit
----------------------------------
CREATE FUNCTION CustomStringTableSplit(@values StringItems READONLY,
                                       @delimiter nvarchar(max),
                                       @removeDuplicateDelimiters bit = 0,
                                       @removeChars nvarchar(100) = NULL,
                                       @suppressDuplicates bit = 0,
                                       @recalculateOrdinals bit = 0,
                                       @lowerCaseComparison bit = 0)
RETURNS @items TABLE (
   RowNumber int,
   Ordinal     int,
   Part      nvarchar(max)
) AS
BEGIN
   DECLARE @rowNumber int;
   DECLARE @value     nvarchar(max);
   DECLARE valueCursor CURSOR FOR SELECT StringItem
                                  FROM @values;
   
   OPEN valueCursor;  
   FETCH NEXT FROM valueCursor INTO @value;  
   SET @rowNumber = 1;
   WHILE @@FETCH_STATUS = 0  
      BEGIN  
         INSERT INTO @items
             SELECT @rowNumber,
                    i.Ordinal,
                    i.Part
             FROM CustomStringSplit(@value,
                                    @delimiter,
                                    @removeDuplicateDelimiters,
                                    @removeChars,
                                    @suppressDuplicates,
                                    @recalculateOrdinals,
                                    @lowerCaseComparison) i;

         FETCH NEXT FROM valueCursor INTO @value;  
         SET @rowNumber = @rowNumber + 1;
      END;  
   CLOSE valueCursor;  
   DEALLOCATE valueCursor;

   RETURN;
END;

The function above receives basically all the same parameters as the function before. However, this time, the value isn’t a single row but a set of rows using the type defined earlier.

The code creates a cursor for the table passed as a parameter and when looping through the data, all the rows returned from CustomStringSplit are appended to the result set. The result set for CustomStringTableSplit contains all same columns as the previous function but also an additional column RowNumber, which defines the number of the row in the original data from which the returned rows were extracted.

Testing CustomStringTableSplit

Now let’s run a few tests with CSV data:

SQL

DECLARE  @items StringItems
INSERT INTO @items VALUES
('1,2,3,4,5'),
('''A'',''B'',''C'',,''E'''),
('4,5'),
('a,B,c');

SELECT * FROM CustomStringTableSplit(@items, ',', DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT);

The result from the call above would be:

RowNumber   Ordinal   Part
---------   -------   -----------
1           1         1
1           2         2
1           3         3
1           4         4
1           5         5
2           1         'A'
2           2         'B'
2           3         'C'
2           4         
2           5         'E'
3           1         4
3           2         5
4           1         a
4           2         B
4           3         c

Again, if we would like to remove the apostrophes, it would look like:

SQL

SELECT * FROM CustomStringTableSplit(@items, ',', 0, '''', 0, 0, 0);

And now the result is:

RowNumber   Ordinal   Part 
---------   -------   -----------
1           1         1
1           2         2
1           3         3
1           4         4
1           5         5
2           1         A
2           2         B
2           3         C
2           4         
2           5         E
3           1         4
3           2         5
4           1         a
4           2         B
4           3         c

So this kind of function would be easy to use to parse a complete CSV file into a result set containing originating row and column information.

Ensuring That All Rows Contain the Same Number of Columns

One more example is to check if the CSV data contains the same number of columns for all rows. Looking at the test data in the previous example, the rows contain columns as follows:

Row 1, five columns
Row 2, five columns (one empty)
Row 3, two columns
Row 4, three columns

We can use a small CTE to get the number of columns for each row. Consider the following:

SQL

DECLARE  @items StringItems

INSERT INTO @items VALUES
('1,2,3,4,5'),
('''A'',''B'',''C'',,''E'''),
('4,5'),
('a,B,c');

WITH RowColumns (RowNumber, RowMax) AS (
   SELECT a.RowNumber, MAX(a.Ordinal)
   FROM CustomStringTableSplit(@items, ',', DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT) a
   GROUP BY a.RowNumber
)
SELECT * FROM RowColumns;

The result is:

RowNumber   RowMax
---------   ------
1           5
2           5
3           2
4           3

As you can see, there are different number of columns. So if we just want to know if all rows contain the same number of rows, the query could be:

SQL

WITH RowColumns (RowNumber, RowMax) AS (
   SELECT a.RowNumber, MAX(a.Ordinal)
   FROM CustomStringTableSplit(@items, ',', DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT) a
   GROUP BY a.RowNumber
)
SELECT CASE (SELECT COUNT(DISTINCT RowMax) FROM RowColumns)
          WHEN 1 THEN 'All of the rows contain same amount of columns'
          ELSE 'The amount of columns is different between rows'
       END;

Now the result is:

The amount of columns is different between rows

History

26^thDecember, 2018: Created

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Custom String Split Function with Few Enhancements

Table of Contents

History

License