Introduction
With the advent of CLR integration into SQL Server 2005, it has become incredibly easy to extend the power of the T-SQL programming language. Two of the areas that can be improved upon by way of CLR integration are string
matching and string
manipulation.
Background
T-SQL has a handful of basic string
matching functions (e.g. CHARINDEX
, PATINDEX
, SOUNDEX
) and string
matching operators (e.g. =
, <>
, <
, >
, LIKE
). These are insufficient for tasks such as creating a check constraint on a column of email addresses where you want to ensure only valid email addresses exist. The most common way to validate string
s like this is by using regular expressions. This is where CLR integration comes into play.
SQL Server 2005 now allows you to create user defined functions (among other things) using your .NET language of choice. This article will demonstrate how to develop a set of general purpose, user-defined, regular expression functions for consumption in T-SQL using C#.
Using the Code
General Approach
My objective here is to wrap some of the more commonly used static
methods of the RegEx
class in the .NET Framework into something useable in a T-SQL environment. I felt the best approach was to develop this wrapper as a set of user defined functions that closely mirror the inputs and outputs of these methods in the RegEx
class.
Interface
All four of the functions listed in this article share the same first two parameters:
@Input NVARCHAR(MAX)
This is the string
to be analyzed. You may pass either a literal string
, or a column name. This is the string
on which the regular expression will be executed.
@Pattern NVARCHAR(MAX)
This is the regular expression which will be executed against the @Input
parameter.
In addition, all four functions share the same last parameter.
@IgnoreCase BIT
This is a boolean parameter which when set to "1
" will instruct the regular expression engine to ignore cases when executing the regular expression against the input string
. If this parameter is set to "0
", a case-sensitive analysis will be performed.
Functions
ufn_RegExIsMatch
The purpose of this function is to mirror the functionality of the Regex.IsMatch
method. In short, if the pattern specified in the @Pattern
parameter is found within the string
specified by @Input
, the return value will be "1
". Otherwise, the return value will be "0
".
ufn_RegExMatches
While ufn_RegExIsMatch
will tell you if there is a match of @Pattern
within @Input
, this function will tell you what the matches are, where they are located in the string
and how long each match is. This function wraps the functionality of the Regex.Matches
method in the .NET Framework. This function returns a table. The columns of this table are as follows:
Match NVARCHAR(MAX)
MatchIndex INT
MatchLength INT
ufn_RegExReplace
This function mirrors the functionality of Regex.Replace
in the .NET Framework and closely resembles the functionality of the REPLACE
function in T-SQL. The primary difference between ufn_RegExReplace
and REPLACE
is that the matching in REPLACE
is done solely on literal string
match comparison while ufn_RegExReplace
matches based on the regular expression specified in the @Pattern
parameter.
ufn_RegExReplace
also takes an additional parameter that none of the other functions in this library contain which is the @Replacement
parameter. This is another NVARCHAR(MAX)
parameter which specifies the literal string
used to replace matches identified in @Input
when @Pattern
is executed.
This function returns an NVARCHAR(MAX)
which represents the input string
with the specified replacements in place.
ufn_RegExSplit
Splitting a delimited string
into its elements is a common task among T-SQL developers. The ufn_RegExSplit
function does exactly that. This implementation uses the Regex.Split
method from the .NET Framework. ufn_RegExSplit
, much like the ufn_RegExMatches
function returns a table as its output. This table however has only one column which is of the data type NVARCHAR(MAX)
. The column name is Match
and it contains the elements of the string
split out by the delimiter specified in @Pattern
.
Note: I realize that for most implementations, simply delimiter splitting is sufficient and for such a situation, I would recommend a similar function that uses the String.Split
method rather than regular expressions. However, since the focus of this article is regular expressions in T-SQL, I decided to stick with the regular expression implementation.
SQL Server Implementation
Now that the assembly has been created, we need to implement it into SQL Server. The steps below outline the process.
Determine the Database where the Assembly will Reside
Assemblies are database level objects – not server level objects. You will need to select the database that will contain the assembly:
use AdventureWorks
GO
Enable CLR Integration
CLR integration is disabled by default. To enable it, you must run sp_configure
and set the "clr enabled
" property to 1
. You must then issue the "RECONFIGURE
" command for the setting to take effect. Otherwise, the setting will not take effect until SQL Server is restarted.
exec sp_configure 'clr enabled', 1
GO
RECONFIGURE
GO
Create the Assembly
Once you've enabled CLR integration, the next step is to "create the assembly" by importing it from the actual compiled DLL file. SQL Server stores the assembly as a stream of bytes inside of the database. Once the assembly has been imported using the "CREATE ASSEMBLY
" command, you no longer need the actual DLL file.
CREATE ASSEMBLY [SqlRegEx] FROM 'C:\SqlRegEx.dll' WITH PERMISSION_SET = SAFE
Create the Functions
Once the assembly has been created, you can create the actual functions. The syntax to do this is very similar to creating a T-SQL function. The primary difference is instead of writing out the function body, you simply specify "AS EXTERNAL NAME
" followed by [AssemblyName].[Namespace.Class].[Method] where [AssemblyName] is the name of the assembly in SQL Server and Namespace, Class and Method all refer to the namespace, class and method inside of the assembly.
CREATE FUNCTION [dbo].[ufn_RegExIsMatch]
(@Input NVARCHAR(MAX), @Pattern NVARCHAR(MAX), @IgnoreCase BIT)
RETURNS BIT
AS EXTERNAL NAME SqlRegEx.[SqlClrTools.SqlRegEx].RegExIsMatch
Call the Functions
At this point, the functions are loaded and ready to go. To call them, you simply use the same syntax that you would to call any other T-SQL function.
SELECT dbo.ufn_RegExIsMatch('Hello World', 'w', 1)
SELECT dbo.ufn_RegExIsMatch('Hello World', 'w', 0)
Points of Interest
PERMISSION_SET
When creating an assembly in SQL Server, you must specify the requested permission set for the assembly.
In most instances, you will select "SAFE
" as this article does. This means that you are not directly accessing system resources such as the disk, network, etc. This is the most restrictive and safest set of permissions.
If you require access to the disk or network or other resources you may need to use the "EXTERNAL_ACCESS
" permission set.
In the most rare of circumstances, you may need to use "UNSAFE
". This permission set grants everything that EXTERNAL_ACCESS
grants and also allows access to unmanaged code. Use this setting with great caution.
Unicode and NVARCHAR
All .NET string
s are handled in Unicode. As such, all string
data into and out of these functions is of the NVARCHAR
data type. I chose to use the "MAX
" length as we have no idea how large the string
s are that are being passed to these functions. Of course, you may wish to limit this in your own implementation.
Attributes
You may notice that each of the methods included in the SqlRegEx
class are decorated with certain attributes. The list below will outline the purpose of each.
DataAccess
The DataAccess
attribute signals to SQL Server whether this method will access any user data on the current instance of SQL Server. In the case of these functions, none of them do so directly, so the value for this attribute is false
for each.
IsDeterministic
To quote the SQL Server Books Online "Deterministic functions always return the same result any time they are called with a specific set of input values and given the same state of the database. Non-deterministic functions may return different results each time they are called with a specific set of input values even if the database state that they access remains the same." All of the functions listed in this article are deterministic and as such this property has been marked true
.
IsPrecise
The IsPrecise
attribute indicates if the result is precise or not. For example, a function calculating values using the FLOAT
data type would not be precise as the FLOAT
data type can lose information during calculations. The functions in this library are precise and as such, this attribute is marked as true
for each function.
Name
The Name
attribute contains a string
indicating what this function should be called when registered in SQL Server. This field is not required and has no practical value in SQL Server, but I have set it for each of these functions for my own personal documentation.
SystemDataAccess
The SystemDataAccess
attribute is much the same as the DataAccess
attribute with the exception that it refers to system data rather than user data.
FillRowMethodName
The FillRowMethodName
is only set for the functions that return tables. You will notice that for each function that returns a table, the actual return type in the .NET code is IEnumerable
. That means that whatever you return from this function must implement the IEnumerable
interface. The IEnumerable
interface allows the .NET Framework to loop through your results and for each result in your result set, it will call the function listed in the FillRowMethodName
attribute. If you look at the functions that are listed in the FillRowMethodName
attribute, you will see that they each take one object input parameter which represents the current element in the above mentioned loop. The rest of the parameters are output parameters and they will represent the columns of table that is output to SQL Server.
One thing that bothers me about this particular setup is that it does not appear that you can use System.Collections.Generic.IEnumerable<T>
for a return type from your function. Instead you must use System.Collections.IEnumerable
. This means that you must accept an object for the first parameter in your fill row method instead of being able to specify the actual type. This of course means boxing and un-boxing operations each time you create a new row. Since generics are available in .NET 2.0, I think it would be appropriate to allow them here.
Deterministic vs. Non-Deterministic
It is worth noting that if you plan to use either of the scalar functions listed in this article in a computed column and you wish to persist that computed column, the function must be marked as "Deterministic". You cannot persist a computed column that contains anything that is not deterministic. If you cannot persist your column, you will not be able to apply any indexes to it and it will be re-computed every time it is called. If your function is truly deterministic, you would want to make sure that it is not recomputed every time since by definition deterministic functions always return the same information, you would waste CPU cycles.
Summary
The CLR is a powerful new tool available to SQL Server developers. It opens up nearly the entire .NET Framework to SQL Server and if used appropriately, can bring tremendous power and value to any SQL Server application.