Introduction
Regular Expressions are probably the best tool for text parsing. Using Regular Expressions GREATLY simplifies anything for which you would normally use CHARINDEX
, PATINDEX
, SUBSTRING
, STUFF
, REPLACE
, etc.
xp_regex
is an Extended Stored Procedure for SQL Server 2000 written in Managed C++ that lets you use Regular Expressions from T-SQL. In case you�re skeptical about the performance when mixing non-.NET code (in this case SQL Server) and .NET code in the same process, don't be. I've written a non-.NET version and the performance is about the same (although I really haven't done any exhaustive testing). If you have other problems that are preventing you from using the .NET Framework on your SQL Server, you can get the non-.NET version, "http://www.codeproject.com/database/xp_pcre.asp">xp_pcre
, which is also posted on Code Project.
Note to users of .NET 1.0: This DLL will not work on SQL Servers that are running with the /3GB switch in the boot.ini file. You'll probably get an error like:
Cannot load the DLL xp_regex.dll, or one of the DLLs it references.
Reason: 1114(A dynamic link library (DLL) initialization routine failed.).
This limitation has been removed in .NET 1.1.
Also note: There are some issues associated with running managed code in-process with SQL Server. Check out Clemens Vasters' blog entry here. Also, check out this MSKB article which discusses extended stored procedures written using .NET.
I can tell you that I have been running this on several servers since I wrote it and have not had any problems. The servers I run it on are used more for reporting than OLTP-style transactions. I'm not pushing either the CPU or RAM to the limit and am not using fibers. This may account for the fact that I have not encountered any issues.
Overview
There are four Extended Stored Procedures in the DLL:
xp_regex_match
xp_regex_format
xp_regex_split
xp_regex_replace
XP_REGEX_MATCH
Syntax:
xp_regex_match @input, @regex, @result OUTPUT
@input
is the text to match against.
@regex
is the regular expression to match with.
@result
is an output parameter that will hold either 'Matched', 'Failed' or NULL
in the case of an invalid regex.
All parameters are either (N)VARCHAR
or (N)CHAR
of any ength. ((N)TEXT
might work too, haven't tried it.)
xp_regex_match
is used to check an input string against a regular expression. It will return either 'Matched' or 'Failed'. If the regular expression was not valid, it will return NULL
.
XP_REGEX_FORMAT
Syntax:
xp_regex_format @input, @regex, @format_string, @result OUTPUT
@input
is the text to parse.
@regex
is the regular expression to match.
@format_string
is used to format the results.
@result
is an output parameter that will hold the formatted results.
xp_regex_format
is used to parse an input string and format the results. Probably the best example to demonstrate is by doing some telephone number parsing. I�ll assume you know the Regular Expression syntax for this documentation.
The regex [^\d]*(\d{3})[^\d]*(\d{3})[^\d]*(\d{4})
will parse just about any phone-number-like string you throw at it. For instance, this code:
DECLARE @out VARCHAR(50)
EXEC xp_regex_format '(310)555-1212', '[^\d]*(\d{3})[^\d]*(\d{3})[^\d]*(\d{4})',
'($1) $2-$3', @out OUTPUT
PRINT @out
EXEC xp_regex_format '310.555.1212', '[^\d]*(\d{3})[^\d]*(\d{3})[^\d]*(\d{4})',
'($1) $2-$3', @out OUTPUT
PRINT @out
EXEC xp_regex_format ' 310!555 hey! 1212',
'[^\d]*(\d{3})[^\d]*(\d{3})[^\d]*(\d{4})',
'($1) $2-$3', @out OUTPUT
PRINT @out
EXEC xp_regex_format ' hello, ( 310 ) 555.1212 is my phone number. Thank you.',
'[^\d]*(\d{3})[^\d]*(\d{3})[^\d]*(\d{4})', '($1) $2-$3', @out OUTPUT
PRINT @out
prints out:
(310) 555-1212
(310) 555-1212
(310) 555-1212
(310) 555-1212
XP_REGEX_SPLIT
Syntax:
xp_regex_split @input, @regex, @column_number, @result OUTPUT
@input
is the text to parse.
@regex
is a regular expression that matches the delimiter.
@column_number
: We�re basically doing a "text-to-columns" here, so
@column_number
lets you specify which of the resulting columns should be passed back in the
@result
parameter.
@result
is an output parameter that will hold the formatted results
@input
, @regex
and @result
are either (N)VARCHAR
or (N)CHAR
of any length.
@column_number
is an INT
. Columns are numbered starting at 1.
This function splits text data on some sort of delimiter (comma, pipe, whatever). The cool thing about a split using regular expressions is that the delimiter does not have to be as consistent as you would normally expect.
For example, take this line as your source data:
one ,two|three : four
In this case, our delimiter is either a comma, pipe or colon with any number of spaces either before or after (or both). In regex form, that is written: \s*[,|:]\s*
.
For example:
DECLARE @out VARCHAR(8000)
EXEC xp_regex_split 'one ,two|three : four', '\s*[,|:]\s*', 1, @out OUTPUT
PRINT @out
EXEC xp_regex_split 'one ,two|three : four', '\s*[,|:]\s*', 2, @out OUTPUT
PRINT @out
EXEC xp_regex_split 'one ,two|three : four', '\s*[,|:]\s*', 3, @out OUTPUT
PRINT @out
EXEC xp_regex_split 'one ,two|three : four', '\s*[,|:]\s*', 4, @out OUTPUT
PRINT @out
prints out:
one
two
three
four
Please note that in the case above, there is no performance penalty for running the same split more than once. xp_regex_split
caches both the input string and the regular expression, so calling it multiple times while changing only the @column_number
parameter is perfectly fine. The actual split is only done the first time and the rest is pulled from cache.
XP_REGEX_REPLACE
Syntax:
xp_regex_replace @input, @regex, @replacement, @result OUTPUT
@input
is the text to parse.
@regex
is a regular expression.
@replacement
will replace every piece of text that matches the regex
@result
is an output parameter that will hold the results.
For example, this is how you would remove all whitespace from an input string:
DECLARE @out VARCHAR(8000)
EXEC xp_regex_replace ' one two three four ', '\s*', '', @out OUTPUT
PRINT '[' + @out + ']'
prints out:
[onetwothreefour]
To replace all numbers (regardless of length) with "###":
DECLARE @out VARCHAR(8000)
EXEC xp_regex_replace '12345 is less than 99999, but not 1, 12, or 123',
'\d+', '###', @out OUTPUT
PRINT @out
prints out:
### is less than ###, but not ###, ###, or ###
FN_XP_REGEX_MATCH, FN_XP_REGEX_FORMAT, FN_XP_REGEX_SPLIT and FN_XP_REGEX_REPLACE
These are user-defined functions that wrap the stored procedures. This way you can use the function as part of a SELECT
list or a WHERE
clause:
USE pubs
GO
SELECT dbo.fn_regex_format(
phone_number,
'[^\d]*(\d{3})[^\d]*(\d{3})[^\d]*(\d{4})',
'$1 $2 $3'
)
FROM
authors
This would format every phone number in the "authors" table.
Please note that FN_XP_REGEX_MATCH
, returns a 1 or 0 instead of 'Matched' or 'Failed' (like XP_REGEX_MATCH
does). You can certainly change this back in the INSTALL.SQL code. Also note, you'll need to create the UDFs in each database that you call them from.
Installation
- Copy xp_regex.dll to your Program Files\Microsoft SQL Server\MSSQL\bin folder.
- Run the SQL script INSTALL.SQL. This will register the procedures and create the user-defined functions.
Performance Counters
I've created a set of Performance Counters (the Performance Object is called "xp_regex
"). You can use the Windows 2000 System Monitor to see various statistics about xp_regex
while it is running.
If you want to avoid the small overhead of the performance counters, just comment out this line in PerfCounters.h and rebuild:
#define XP_REGEX_USING_PERF_COUNTERS
Misc
Comments/corrections/additions are welcome. Please let me know if you find this useful! Thanks!
History
- 06 Oct 03 - Included a version of xp_regex.dll that was compiled without the performance counters. Added the warning about running managed code in-process with SQL Server.
- 19 Jul 03 - Complete refactoring to a more object-oriented design. Also added support for Unicode parameters. Changed the OUTPUT parameter to Unicode. Greatly improved parameter checking and error handling.
- 07 May 03 - Rebuilt using Visual Studio .NET 2003 and .NET Framework 1.1
- 29 Mar 03 - Updated download