Table of Contents
First of all, if you’re looking for a function just to split a string and you’re using SQL Server 2016 or above, consider using STRING_SPLIT (Transact-SQL).
When answering a question recently, I wrote a small function to split a string based on spaces. In order to keep the answer simple, I decided to use recursive CTE (Common Table Expression) for the separation process. With a few enhancements to the answer, the function could be used in several situations like finding individual words from a sentence or splitting a row of CSV data into multiple rows.
The functions presented here contain the following functionality:
- Split a string to a set of rows based on a delimiter
- Provide ordinal value for each item found
- Ability to remove consecutive delimiters if no data is present between them
- Remove all occurrences of non-desired characters
- Functionality to remove duplicates
- If needed, recalculate the ordinals for remaining values
- Perform case insensitive comparison for duplicates
The complete function looks like this:
CREATE FUNCTION CustomStringSplit(@value nvarchar(max),
@delimiter nvarchar(max),
@removeDuplicateDelimiters bit = 0,
@removeChars nvarchar(100) = NULL,
@suppressDuplicates bit = 0,
@recalculateOrdinals bit = 0,
@lowerCaseComparison bit = 0)
RETURNS @items TABLE (
Ordinal int,
Part nvarchar(max)
) AS
BEGIN
DECLARE @charCounter int;
IF @removeDuplicateDelimiters = 1 BEGIN
WHILE CHARINDEX(REPLICATE(@delimiter, 2), @value) > 0 BEGIN
SET @value = REPLACE(@value, REPLICATE(@delimiter, 2), @delimiter);
END;
END;
IF @removeChars IS NOT NULL BEGIN
SET @charCounter = 1;
WHILE @charCounter <= LEN(@removeChars) BEGIN
SET @value = REPLACE(@value, SUBSTRING(@removeChars, @charCounter, 1), '');
SET @charCounter = @charCounter + 1;
END;
END;
WITH Items (Ordinal, Part, Remainder) AS (
SELECT 1 AS Ordinal,
CASE
WHEN CHARINDEX(@delimiter, @value) > 0
THEN LEFT(@value, CHARINDEX(@delimiter, @value) - 1)
ELSE @value
END AS Part,
CASE
WHEN CHARINDEX(@delimiter, @value) > 0
THEN SUBSTRING(@value,
CHARINDEX(@delimiter, @value) + DATALENGTH(@delimiter) / 2,
99999999)
ELSE NULL
END AS Remainder
UNION ALL
SELECT Items.Ordinal + 1 AS Ordinal,
CASE
WHEN CHARINDEX(@delimiter, Remainder) > 0
THEN LEFT(Remainder, CHARINDEX(@delimiter, Remainder) - 1)
ELSE Remainder
END AS Part,
CASE
WHEN CHARINDEX(@delimiter, Remainder) > 0
THEN SUBSTRING(Remainder,
CHARINDEX(@delimiter, Remainder) + DATALENGTH(@delimiter) / 2,
99999999)
ELSE NULL
END AS Remainder
FROM Items
WHERE Items.Remainder IS NOT NULL
)
INSERT INTO @items (Ordinal, Part)
SELECT Items.Ordinal, Items.Part
FROM Items
OPTION (MAXRECURSION 32767);
IF @suppressDuplicates = 1 BEGIN
DELETE i FROM @items AS i
WHERE EXISTS (SELECT 1
FROM @items i2
WHERE CASE @lowerCaseComparison
WHEN 1 THEN LOWER(i2.Part)
ELSE i2.Part
END = CASE @lowerCaseComparison
WHEN 1 THEN LOWER(i.Part)
ELSE i.Part
END
AND i2.Ordinal < i.Ordinal);
IF @recalculateOrdinals = 1 BEGIN
UPDATE i
SET Ordinal = (SELECT COUNT(*)
FROM @items i2
WHERE i2.Ordinal <= i.Ordinal)
FROM @items i;
END;
END;
RETURN;
END;
First of all, it returns a table so individual items found from the string will be returned as a result set, a table. This kind of function is a Table Valued Function (TVF). The parameters are:
@value
, the string value to split @delimiter
, delimiter to use when splitting the data. Delimiter can be a single character or, for example, a word @removeDuplicateDelimiters
, should consecutive delimiters be removed if no data is present between them?
Value can be either 0
(false) or 1
(true), the default is 0
(false). @removeChars
, what characters should be removed from the result. If the string contains multiple characters, each of them is removed from the string individually.
The default in NULL
. @suppressDuplicates
, should duplicate values be removed from the result set?
Value can be either 0
(false) or 1
(true), the default is 0
(false). @recalculateOrdinals
, if duplicate values are removed, should the original ordinal of the first occurrence be retained or should the ordinal numbering be recalculated?
Value can be either 0
(false) or 1
(true), the default is 0
(false). @lowerCaseComparison
, when removing the duplicates, should the comparison be done using lower case (case-insensitivity) or as-is. This is meaningful only if the comparison in the database is case sensitive by default.
Value can be either 0
(false) or 1
(true), the default is 0
(false).
IF @removeDuplicateDelimiters = 1 BEGIN
WHILE CHARINDEX(REPLICATE(@delimiter, 2), @value) > 0 BEGIN
SET @value = REPLACE(@value, REPLICATE(@delimiter, 2), @delimiter);
END;
END;
The code above tries to find duplicate delimiters. For example, if the delimiter is ','
the code repeats the delimiter twice resulting into ',,'
and replaces all occurrences with a single delimiter. The replacement is done inside a loop since the CHARINDEX
function does only a single, forwarding scan on the string so multiple passes are required in case the delimiter is repeated more than twice in the original string.
IF @removeChars IS NOT NULL BEGIN
SET @charCounter = 1;
WHILE @charCounter <= LEN(@removeChars) BEGIN
SET @value = REPLACE(@value, SUBSTRING(@removeChars, @charCounter, 1), '');
SET @charCounter = @charCounter + 1;
END;
END;
Removing the unwanted characters is done by looping through the individual characters and replacing all occurrences with an empty value. In other words, removing them.
WITH Items (Ordinal, Part, Remainder) AS (
SELECT 1 AS Ordinal,
CASE
WHEN CHARINDEX(@delimiter, @value) > 0
THEN LEFT(@value, CHARINDEX(@delimiter, @value) - 1)
ELSE @value
END AS Part,
CASE
WHEN CHARINDEX(@delimiter, @value) > 0
THEN SUBSTRING(@value,
CHARINDEX(@delimiter, @value) + DATALENGTH(@delimiter) / 2,
99999999)
ELSE NULL
END AS Remainder
UNION ALL
SELECT Items.Ordinal + 1 AS Ordinal,
CASE
WHEN CHARINDEX(@delimiter, Remainder) > 0
THEN LEFT(Remainder, CHARINDEX(@delimiter, Remainder) - 1)
ELSE Remainder
END AS Part,
CASE
WHEN CHARINDEX(@delimiter, Remainder) > 0
THEN SUBSTRING(Remainder,
CHARINDEX(@delimiter, Remainder) + DATALENGTH(@delimiter) / 2,
99999999)
ELSE NULL
END AS Remainder
FROM Items
WHERE Items.Remainder IS NOT NULL
)
INSERT INTO @items (Ordinal, Part)
SELECT Items.Ordinal, Items.Part
FROM Items
OPTION (MAXRECURSION 32767);
This is the main thing. The common table expression uses three columns:
Ordinal
, each pass increases the ordinal number by one resulting into an incremental row numbering Part
, all the data before the next delimiter in the text Remainder
, remaining text after the next delimiter in the text
For example, if the text would be 'This is a sentence.'
and the delimiter is a single space, the query would return the following rows.
Ordinal Part Remainder
------- --------- ----------------
1 This is a sentence.
2 is a sentence.
3 a sentence.
4 sentence. NULL
The recursion continues until the value in remainder is NULL
and after that, the result is returned to the calling statement. In this case, the statement is an INSERT
statement adding the data into the table variable which is going to be returned to the caller in the end.
When creating this, one catch was that I tried to use LEN
function. However, LEN
does automatically trim trailing blanks so it's not usable in this scenario. When changing LEN
to DATALENGTH
, the return value must be divided by 2
since the function returns the number of bytes used. For nvarchar
, each character uses two bytes.
DELETE i FROM @items AS i
WHERE EXISTS (SELECT 1
FROM @items i2
WHERE CASE @lowerCaseComparison
WHEN 1 THEN LOWER(i2.Part)
ELSE i2.Part
END = CASE @lowerCaseComparison
WHEN 1 THEN LOWER(i.Part)
ELSE i.Part
END
AND i2.Ordinal < i.Ordinal);
This statement removes the duplicates, either using the default comparison in the database or comparing the lower-case values in order to do a case insensitive comparison, if required.
Few things may look a bit peculiar:
- Table alias for the table variable. Since the row from the table variable is referenced in the correlated
EXISTS
-condition an alias is needed for the table. The syntax to define the alias is
DELETE <table alias> FROM <table variable> AS <table alias>
- Comparison in
EXISTS
. A WHERE
clause can contain CASE
statements. Normally, these should be avoided because of performance reasons, but in this case, I felt that in majority of the cases, the amount of rows would be so small that the performance penalty is not significant- However, if this causes a problem, this single statement could easily be separated to two different statements.
IF @recalculateOrdinals = 1 BEGIN
UPDATE i
SET Ordinal = (SELECT COUNT(*)
FROM @items i2
WHERE i2.Ordinal <= i.Ordinal)
FROM @items i;
END;
And the last part recalculates the ordinal numbers of remaining items. If duplicate values are eliminated and recalculation is not done, then the result set contains the ordinal number of the first occurrence for each item.
So let’s run some tests using the function. First, we'll use a small sentence to demonstrate the functionality. Consider the following code:
DECLARE @repeating varchar(max);
SET @repeating = 'You cannot end a sentence with because, because because is a conjunction.';
SELECT * FROM CustomStringSplit(@repeating, ' ', DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT);
The result is:
Ordinal Part
------- ----------------
1 You
2
3
4 cannot
5
6
7 end
8 a
9 sentence
10 with
11 because,
12 because
13 because
14 is
15 a
16 conjunction.
So the whole string is split into individual items delimited by a single space. Now, if you want to remove duplicate delimiters with no data in-between and remove points and commas, the call would look like:
SELECT * FROM CustomStringSplit(@repeating, ' ', 1 , '.,', DEFAULT, DEFAULT, DEFAULT);
And the result would be:
Ordinal Part
------- ----------------
1 You
2 cannot
3 end
4 a
5 sentence
6 with
7 because
8 because
9 because
10 is
11 a
12 conjunction
Now we have all words from the sentence. Now if you would like the result to contain each word only once, the call would be for example:
SELECT * FROM CustomStringSplit(@repeating, ' ', 1 , '.,', 1, 0, 0);
Now the result is:
Ordinal Part
------- -------------------
1 You
2 cannot
3 end
4 a
5 sentence
6 with
7 because
10 is
12 conjunction
Notice that while only nine rows are returned, the ordinal numbering goes up to 12
, showing the original position of each word. If you would like the ordinal numbering recalculated, the call would be:
SELECT * FROM CustomStringSplit(@repeating, ' ', 1 , '.,', 1, 1, 0);
And now, the result is:
Ordinal Part
------- -------------------
1 You
2 cannot
3 end
4 a
5 sentence
6 with
7 because
8 is
9 conjunction
For a different type of test, let’s use a single word as a delimiter. Consider the following case:
DECLARE @repeating varchar(max);
SET @repeating = 'This is first part and then the second and and '
+ 'after that comes the third and finally the fourth.';
SELECT *
FROM CustomStringSplit(@repeating, ' and ', DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT);
The call above would split the sentence into five parts. So the result is:
Ordinal Part
------- -------------------
1 This is first part
2 then the second
3
4 after that comes the third
5 finally the fourth.
Split functions are often used for splitting CSV (Comma Separated Values) data into individual items. You can try with the following:
DECLARE @singleCsvRow varchar(max);
SET @singleCsvRow = '1,2,3,''A B'',''C D'',1';
SELECT *
FROM CustomStringSplit(@singleCsvRow, ',', DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT);
The call above would result in:
Ordinal Part
------- -------------------
1 1
2 2
3 3
4 'A B'
5 'C D'
6 1
If you’d like to remove the apostrophes from the text data, the call would be:
SELECT * FROM CustomStringSplit(@singleCsvRow, ',', 1 , '''', DEFAULT, DEFAULT, DEFAULT);
The result is almost the same, only without apostrophes:
Ordinal Part
------- -------------------
1 1
2 2
3 3
4 A B
5 C D
6 1
To take the CSV handling a bit further, the function presented in the beginning can split an individual CSV row. However, typically the requirement is to split all CSV rows for example from a file. In order to do that, let’s create a small utility function to handle multiple rows.
First, we need a type that we can use for a parameter definition:
CREATE TYPE StringItems AS TABLE (
StringItem nvarchar(max)
);
Then the function:
CREATE FUNCTION CustomStringTableSplit(@values StringItems READONLY,
@delimiter nvarchar(max),
@removeDuplicateDelimiters bit = 0,
@removeChars nvarchar(100) = NULL,
@suppressDuplicates bit = 0,
@recalculateOrdinals bit = 0,
@lowerCaseComparison bit = 0)
RETURNS @items TABLE (
RowNumber int,
Ordinal int,
Part nvarchar(max)
) AS
BEGIN
DECLARE @rowNumber int;
DECLARE @value nvarchar(max);
DECLARE valueCursor CURSOR FOR SELECT StringItem
FROM @values;
OPEN valueCursor;
FETCH NEXT FROM valueCursor INTO @value;
SET @rowNumber = 1;
WHILE @@FETCH_STATUS = 0
BEGIN
INSERT INTO @items
SELECT @rowNumber,
i.Ordinal,
i.Part
FROM CustomStringSplit(@value,
@delimiter,
@removeDuplicateDelimiters,
@removeChars,
@suppressDuplicates,
@recalculateOrdinals,
@lowerCaseComparison) i;
FETCH NEXT FROM valueCursor INTO @value;
SET @rowNumber = @rowNumber + 1;
END;
CLOSE valueCursor;
DEALLOCATE valueCursor;
RETURN;
END;
The function above receives basically all the same parameters as the function before. However, this time, the value isn’t a single row but a set of rows using the type defined earlier.
The code creates a cursor for the table passed as a parameter and when looping through the data, all the rows returned from CustomStringSplit
are appended to the result set. The result set for CustomStringTableSplit
contains all same columns as the previous function but also an additional column RowNumber
, which defines the number of the row in the original data from which the returned rows were extracted.
Now let’s run a few tests with CSV data:
DECLARE @items StringItems
INSERT INTO @items VALUES
('1,2,3,4,5'),
('''A'',''B'',''C'',,''E'''),
('4,5'),
('a,B,c');
SELECT * FROM CustomStringTableSplit(@items, ',', DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT);
The result from the call above would be:
RowNumber Ordinal Part
--------- ------- -----------
1 1 1
1 2 2
1 3 3
1 4 4
1 5 5
2 1 'A'
2 2 'B'
2 3 'C'
2 4
2 5 'E'
3 1 4
3 2 5
4 1 a
4 2 B
4 3 c
Again, if we would like to remove the apostrophes, it would look like:
SELECT * FROM CustomStringTableSplit(@items, ',', 0, '''', 0, 0, 0);
And now the result is:
RowNumber Ordinal Part
--------- ------- -----------
1 1 1
1 2 2
1 3 3
1 4 4
1 5 5
2 1 A
2 2 B
2 3 C
2 4
2 5 E
3 1 4
3 2 5
4 1 a
4 2 B
4 3 c
So this kind of function would be easy to use to parse a complete CSV file into a result set containing originating row and column information.
One more example is to check if the CSV data contains the same number of columns for all rows. Looking at the test data in the previous example, the rows contain columns as follows:
- Row 1, five columns
- Row 2, five columns (one empty)
- Row 3, two columns
- Row 4, three columns
We can use a small CTE to get the number of columns for each row. Consider the following:
DECLARE @items StringItems
INSERT INTO @items VALUES
('1,2,3,4,5'),
('''A'',''B'',''C'',,''E'''),
('4,5'),
('a,B,c');
WITH RowColumns (RowNumber, RowMax) AS (
SELECT a.RowNumber, MAX(a.Ordinal)
FROM CustomStringTableSplit(@items, ',', DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT) a
GROUP BY a.RowNumber
)
SELECT * FROM RowColumns;
The result is:
RowNumber RowMax
--------- ------
1 5
2 5
3 2
4 3
As you can see, there are different number of columns. So if we just want to know if all rows contain the same number of rows, the query could be:
WITH RowColumns (RowNumber, RowMax) AS (
SELECT a.RowNumber, MAX(a.Ordinal)
FROM CustomStringTableSplit(@items, ',', DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT) a
GROUP BY a.RowNumber
)
SELECT CASE (SELECT COUNT(DISTINCT RowMax) FROM RowColumns)
WHEN 1 THEN 'All of the rows contain same amount of columns'
ELSE 'The amount of columns is different between rows'
END;
Now the result is:
The amount of columns is different between rows
History
- 26th December, 2018: Created