(untagged)

xp_regex: Regular Expressions in SQL Server 2000

Dan Farino

0.00/5 (No votes)

6 Oct 2003

This is a SQL Server 2000 Extended Stored Procedure writted in Managed C++. It allows you to use regular expressions in T-SQL.

Download source - 109 Kb

Introduction

Regular Expressions are probably the best tool for text parsing. Using Regular Expressions GREATLY simplifies anything for which you would normally use CHARINDEX, PATINDEX, SUBSTRING, STUFF, REPLACE, etc.

xp_regex is an Extended Stored Procedure for SQL Server 2000 written in Managed C++ that lets you use Regular Expressions from T-SQL. In case you�re skeptical about the performance when mixing non-.NET code (in this case SQL Server) and .NET code in the same process, don't be. I've written a non-.NET version and the performance is about the same (although I really haven't done any exhaustive testing). If you have other problems that are preventing you from using the .NET Framework on your SQL Server, you can get the non-.NET version, "http://www.codeproject.com/database/xp_pcre.asp">xp_pcre, which is also posted on Code Project.

Note to users of .NET 1.0: This DLL will not work on SQL Servers that are running with the /3GB switch in the boot.ini file. You'll probably get an error like:

Cannot load the DLL xp_regex.dll, or one of the DLLs it references.
Reason: 1114(A dynamic link library (DLL) initialization routine failed.).

This limitation has been removed in .NET 1.1.

Also note: There are some issues associated with running managed code in-process with SQL Server. Check out Clemens Vasters' blog entry here. Also, check out this MSKB article which discusses extended stored procedures written using .NET.

I can tell you that I have been running this on several servers since I wrote it and have not had any problems. The servers I run it on are used more for reporting than OLTP-style transactions. I'm not pushing either the CPU or RAM to the limit and am not using fibers. This may account for the fact that I have not encountered any issues.

Overview

There are four Extended Stored Procedures in the DLL:

xp_regex_match
xp_regex_format
xp_regex_split
xp_regex_replace

XP_REGEX_MATCH

Syntax:

xp_regex_match @input, @regex, @result OUTPUT

@input is the text to match against.
@regex is the regular expression to match with.
@result is an output parameter that will hold either 'Matched', 'Failed' or NULL in the case of an invalid regex.

All parameters are either (N)VARCHAR or (N)CHAR of any ength. ((N)TEXT might work too, haven't tried it.)

xp_regex_match is used to check an input string against a regular expression. It will return either 'Matched' or 'Failed'. If the regular expression was not valid, it will return NULL.

XP_REGEX_FORMAT

Syntax:

xp_regex_format @input, @regex, @format_string, @result OUTPUT

@input is the text to parse.
@regex is the regular expression to match.
@format_string is used to format the results.
@result is an output parameter that will hold the formatted results.

xp_regex_format is used to parse an input string and format the results. Probably the best example to demonstrate is by doing some telephone number parsing. I�ll assume you know the Regular Expression syntax for this documentation.

The regex [^\d]*(\d{3})[^\d]*(\d{3})[^\d]*(\d{4}) will parse just about any phone-number-like string you throw at it. For instance, this code:

DECLARE @out VARCHAR(50)

EXEC xp_regex_format '(310)555-1212', '[^\d]*(\d{3})[^\d]*(\d{3})[^\d]*(\d{4})',
    '($1) $2-$3', @out OUTPUT
PRINT @out

EXEC xp_regex_format '310.555.1212', '[^\d]*(\d{3})[^\d]*(\d{3})[^\d]*(\d{4})',
    '($1) $2-$3', @out OUTPUT
PRINT @out

EXEC xp_regex_format ' 310!555 hey! 1212',
    '[^\d]*(\d{3})[^\d]*(\d{3})[^\d]*(\d{4})',
    '($1) $2-$3', @out OUTPUT
PRINT @out

EXEC xp_regex_format ' hello, ( 310 ) 555.1212 is my phone number. Thank you.',
    '[^\d]*(\d{3})[^\d]*(\d{3})[^\d]*(\d{4})', '($1) $2-$3', @out OUTPUT
PRINT @out

prints out:

(310) 555-1212
(310) 555-1212
(310) 555-1212
(310) 555-1212

XP_REGEX_SPLIT

Syntax:

xp_regex_split @input, @regex, @column_number, @result OUTPUT

@input is the text to parse.
@regex is a regular expression that matches the delimiter.
@column_number: We�re basically doing a "text-to-columns" here, so @column_number lets you specify which of the resulting columns should be passed back in the @result parameter.
@result is an output parameter that will hold the formatted results

@input, @regex and @result are either (N)VARCHAR or (N)CHAR of any length.

@column_number is an INT. Columns are numbered starting at 1.

This function splits text data on some sort of delimiter (comma, pipe, whatever). The cool thing about a split using regular expressions is that the delimiter does not have to be as consistent as you would normally expect.

For example, take this line as your source data:

one ,two|three : four

In this case, our delimiter is either a comma, pipe or colon with any number of spaces either before or after (or both). In regex form, that is written: \s*[,|:]\s*.

For example:

DECLARE @out VARCHAR(8000)

EXEC xp_regex_split 'one  ,two|three  : four', '\s*[,|:]\s*', 1, @out OUTPUT
PRINT @out

EXEC xp_regex_split 'one  ,two|three  : four', '\s*[,|:]\s*', 2, @out OUTPUT
PRINT @out

EXEC xp_regex_split 'one  ,two|three  : four', '\s*[,|:]\s*', 3, @out OUTPUT
PRINT @out

EXEC xp_regex_split 'one  ,two|three  : four', '\s*[,|:]\s*', 4, @out OUTPUT
PRINT @out

prints out:

one  
two
three  
four

Please note that in the case above, there is no performance penalty for running the same split more than once. xp_regex_split caches both the input string and the regular expression, so calling it multiple times while changing only the @column_number parameter is perfectly fine. The actual split is only done the first time and the rest is pulled from cache.

XP_REGEX_REPLACE

Syntax:

xp_regex_replace @input, @regex, @replacement, @result OUTPUT

@input is the text to parse.
@regex is a regular expression.
@replacement will replace every piece of text that matches the regex
@result is an output parameter that will hold the results.

For example, this is how you would remove all whitespace from an input string:

DECLARE @out VARCHAR(8000)
EXEC xp_regex_replace ' one  two    three four ', '\s*', '', @out OUTPUT
PRINT '[' + @out + ']'

prints out:

[onetwothreefour]

To replace all numbers (regardless of length) with "###":

DECLARE @out VARCHAR(8000)
EXEC xp_regex_replace '12345 is less than 99999, but not 1, 12, or 123', 
                      '\d+', '###', @out OUTPUT
PRINT @out

prints out:

### is less than ###, but not ###, ###, or ###

FN_XP_REGEX_MATCH, FN_XP_REGEX_FORMAT, FN_XP_REGEX_SPLIT and FN_XP_REGEX_REPLACE

These are user-defined functions that wrap the stored procedures. This way you can use the function as part of a SELECT list or a WHERE clause:

USE pubs
GO

SELECT dbo.fn_regex_format(
  phone_number,
  '[^\d]*(\d{3})[^\d]*(\d{3})[^\d]*(\d{4})',
  '$1 $2 $3'
  )

FROM
  authors

This would format every phone number in the "authors" table.

Please note that FN_XP_REGEX_MATCH, returns a 1 or 0 instead of 'Matched' or 'Failed' (like XP_REGEX_MATCH does). You can certainly change this back in the INSTALL.SQL code. Also note, you'll need to create the UDFs in each database that you call them from.

Installation

Copy xp_regex.dll to your Program Files\Microsoft SQL Server\MSSQL\bin folder.
Run the SQL script INSTALL.SQL. This will register the procedures and create the user-defined functions.

Performance Counters

I've created a set of Performance Counters (the Performance Object is called "xp_regex"). You can use the Windows 2000 System Monitor to see various statistics about xp_regex while it is running.

If you want to avoid the small overhead of the performance counters, just comment out this line in PerfCounters.h and rebuild:

#define XP_REGEX_USING_PERF_COUNTERS

Misc

Comments/corrections/additions are welcome. Please let me know if you find this useful! Thanks!

History

06 Oct 03 - Included a version of xp_regex.dll that was compiled without the performance counters. Added the warning about running managed code in-process with SQL Server.
19 Jul 03 - Complete refactoring to a more object-oriented design. Also added support for Unicode parameters. Changed the OUTPUT parameter to Unicode. Greatly improved parameter checking and error handling.
07 May 03 - Rebuilt using Visual Studio .NET 2003 and .NET Framework 1.1
29 Mar 03 - Updated download

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here