Introduction
In the course of developing a small application, which I expect to expand in the next few years, I needed to compute and store some MD5 digest strings in a SQL Server 2000 database. When I went in search of working code, so that I wouldn't need to reinvent a wheel that I felt confident existed, I found a tutorial on Extended Stored Procedures for MS SQL Server v7.0, which almost met my needs. Having made the effort to upgrade the code to work with SQL Server 2000, and to support Unicode text in and out, I decided that I should return the favor by publishing the update here.
Background
Although you could put the MD5 digest code in one of the higher layers, or even (perish the thought!) implement the algorithm as a conventional User Defined Function, a really robust implementation, capable of efficiently processing long strings, binary objects such as images, and even whole files, is very CPU intensive. Such applications beg for the speed of a well written C program. Since I already had a stable, proven implementation, written in straight ANSI C, I thought it would be wasteful to attempt to port it to T-SQL. Even if it could be done, the result would be slow and ugly, and would severely limit the scalability of any application that used it.
Extended Stored Procedures were intended to solve exactly this kind of problem, because they let you run CPU bound algorithms at the lowest architectural layer of the database - tables and views - by integrating them into SQL Server, itself.
Extended Stored Procedures communicate with SQL Server through the Open Data Services API, which uses a series of callbacks into the running SQL Server service to identify and fetch parameters required as inputs to the function, to return results, and to report errors such as missing or invalid parameters.
The ODS interface is very flexible. You can pass just about anything into or out of your function, including entire tables.
Using the code
Extended Stored Procedures are implemented as standard Windows DLLs, which SQL Server loads as needed, via LoadLibrary
. This DLL exports two functions, xp_md5
and xp_md5W
. Another function, __GetXpVersion
, is recommended to be incorporated, to be used by the SQL Server service to retrieve the library version for inclusion in error messages.
xp_md5
takes a text, or binary variable, or column value as input, and returns its MD5 digest as a 32 character hexadecimal string. This function treats text as ANSI, and returns the digest as a string of ANSI characters. xp_md5W
takes a text, or binary variable, or column value as input, and returns its MD5 digest as a 32 character hexadecimal string. This function treats text as Unicode, and returns the digest as a string of Unicode characters.
Other than native Unicode support, as opposed to ANSI, the two functions are identical. Both take one required argument, and two optional arguments, as detailed in the following table:
Name | Type | In or Out | Required | Notes |
Plaintext | CHAR , VARCHAR , TEXT , or BINARY | In | Yes | 1 |
Plaintext Length | Long Integer | In | No | 2 |
Digest | CHAR (32) or NCHAR (32) | Out | Yes | 3 |
Notes
- Essentially, any text, including fixed and variable length character strings, image, and binary formats are permitted. If it can fit in a SQL Server variable, you can digest it. Plaintext is the technical term for the text fed to any cryptographic algorithm.
- Regardless of what type of data you pass into the function, the ODS library reports the length in bytes. This makes processing binary objects and Unicode text trivially easy, since the length can be passed to the core MD5 algorithm. Since the MD5 digest algorithm processes bytes, it is unaffected by embedded nulls in binary data and Unicode strings composed entirely of ASCII characters. This also means that you can pass a length of -1, which tells the Extended Stored Procedure to use the length reported by the library for the first argument.
- MD5 digests are always 16 bytes in length, and they always convert to a 32 character hexadecimal string. They can be stored in
CHAR (32)
columns (for ANSI) or NCHAR (32)
columns (for Unicode). If you omit the third argument, the function returns a table of one row and one column, containing the MD5 digest.
Before you can use an Extended Stored Procedure, it must be installed into SQL Server, and registered in the Master database.
- Copy the library, xp_md5.dll, into your Microsoft SQL Server binaries directory. For a default installation of Microsoft SQL Server, this will be C:\Program Files\Microsoft SQL Server\MSSQL\Binn\.
- Adapt the T-SQL script xm_md5.sql, which is included in the \Reference directory of the download, open it into SQL Server Query Analyzer, ensure that the current database is Master, and execute it.
- Adapt the T-SQL scripts ww_md5.sql and ww_md5W.sql, which are also in the \Reference directory, and install them into either the Master database, or into the individual databases that need the MD5 digest functions.
- Use the functions
[dbo].[ww_md5]
and dbo].[ww_md5W]
just as you would any built-in function or regular User Defined Function. You can use them in computed columns, views, queries, and other T-SQL scripts.
The following simple T-SQL script computes the MD5 digest of the Unicode string "Hello, World!", and returns the result as a single celled table.
declare @data nchar (32)
declare @text nvarchar (255)
SELECT [MyPlaceMassage].[dbo].[ww_md5W]( 'Hello, world!' )
The above sample is in Unicode_Hello.sql, which is included in the \Reference directory. Connect to your database in Query Analyzer, open the file, and press F5.
A companion script, ANSI_Hello.sql, computes the MD5 digest of the same string, rendered as ANSI text. If you run both, you will see that they produce different strings.
ANSI string 'Hello, world!' |
6cd3556deb0da54bca060b4c39479839
|
Unicode string 'Hello, world!' |
d227f77fc6c5c3969a0af57ce1789144
|
Points of interest
Unlike most C functions, the prototype of an Extended Stored Procedure, as it might appear in a C header file, is not especially useful to a working programmer. This means that you must find another way to document the arguments. I chose to export the T-SQL scripts that I used to install the companion User Defined Functions that make these two Extended SPs available to the database. For a more complex procedure, these UDFs would be an ideal place to house the documentation, since they are readily accessible from both the Query Analyzer and the Enterprise Manager.
I dispensed with a header altogether, and intended to dispense with a .LIB file, since you would never link to this DLL from a regular C or C++ program. If this library had one, it would list identical signatures for every exported function, because these are dictated by the ODS API. This is because the ODS library, opends60.dll, calls all Extended Stored Procedures on behalf of user code.
Your function calls back into opends60.dll, using methods similar to the ones used by COM and other interfaces that support late binding. As is usually the case with such interfaces, you must copy data supplied by the interface into your own working storage before you can pass it around. Hence, the outermost layer of any Extended Stored Procedure, and the only aspect of it that differs from ordinary C code, is the code that fetches parameters and returns data to the interface.
Just because a caller is supposed to give us at least one, and up to three, parameters, doesn't mean that we can assume that he did. Therefore, the first task is to get an argument count from the interface and test it.
int nArgs = srv_rpcparams ( pSrvProc ) ;
if ( nArgs > 0 )
Since all data binding is late, and both procedures accept arbitrarily large chunks of data, the next task is to call the library with null pointers, to get the size and type of data to be processed.
srv_paraminfo (
pSrvProc ,
PARAM_ORDINAL_1 ,
&cType ,
&uMaxLen ,
&uLen ,
NULL ,
&fNull ) ;
if ( IsValidPlaintextType ( cType ) == FALSE )
{
srv_sendmsg (
pSrvProc ,
SRV_MSG_ERROR ,
XP_MD5_ARG1_INV_TYPE ,
SRV_INFO ,
XP_STATE_1 ,
NULL ,
0 ,
( DBUSMALLINT ) __LINE__ ,
"Extended Stored Procedure xp_md5: Parameter 1 "
"must be a Character, Unicode Character, or binary type." ,
SRV_NULLTERM ) ;
srv_senddone (
pSrvProc ,
SRV_DONE_ERROR ,
0 ,
0 ) ;
return FAIL ;
}
The function IsValidPlaintextType
, a conventional C function, evaluates the type of data being offered by the interface.
BOOL _cdecl IsValidPlaintextType ( BYTE pbytType )
{
if ( pbytType == SRVVARCHAR )
return TRUE ;
if ( pbytType == SRVCHAR )
return TRUE ;
if ( pbytType == SRVTEXT )
return TRUE ;
if ( pbytType == SRVNTEXT )
return TRUE ;
if ( pbytType == SRVBIGCHAR )
return TRUE ;
if ( pbytType == SRVNVARCHAR )
return TRUE ;
if ( pbytType == SRVNCHAR )
return TRUE ;
if ( pbytType == SRVBINARY )
return TRUE ;
if ( pbytType == SRVIMAGE )
return TRUE ;
if ( pbytType == SRVBIGVARBINARY )
return TRUE ;
if ( pbytType == SRVBIGVARCHAR )
return TRUE ;
if ( pbytType == SRVBIGBINARY )
return TRUE ;
return FALSE ;
}
The symbolic constants used by this function are defined in srv.h. If you build this project from scratch, be sure that you move C:\Program Files\Microsoft SQL Server\80\Tools\DevTools\Include to someplace near the top of your list of include paths. Otherwise, you may get the older version of srv.h that comes with Visual Studio 6, which is appropriate for SQL Server 7.0, but lacks many of the newer interfaces and constants used in this procedure, which is intended for use with SQL Server 2000.
I copied the constants into an Excel workbook, srv_constants.xls, which I included in the \Reference directory, because it contains the constant values expressed as hexadecimal (from the header), decimal, and even binary format. The intent for the binary version is to identify bit patterns that I might be able to use to simplify IsValidPlaintextType
and make it a tad faster. Alas, I didn't see any.
Having identified the type of input data, and found it acceptable, the next step is to allocate a buffer for the data. This is accomplished by calling malloc
, followed by SecureZeroMemory
, as follows:
psize = ( uLen + TRAILING_NULL_ALLOWANCE ) ; pData = ( BYTE* ) malloc ( psize ) ; SecureZeroMemory ( pData , psize ) ;
Alternatively, you could call HeapAlloc
and specify HEAP_ZERO_MEMORY
for its dwFlags
argument. However, since the original version from which this was derived, and all of my other MD5 wrapper functions perform their allocations in two steps, as shown here, I elected to be consistent with them. Note the use of SecureZeroMemory
, rather than ZeroMemory
or memset
, because, unlike the other two, SecureZeroMemory
is written in a way that prevents the optimizer from optimizing it away. This is important from a security perspective, because all memory used by cryptographic functions must be cleared before and after each use, to prevent accidental information disclosure.
Now that we have a buffer big enough to hold the data, we call srv_paraminfo
again, passing a pointer to our nice, empty buffer, and its size, in bytes.
srv_paraminfo (
pSrvProc ,
PARAM_ORDINAL_1 ,
&cType ,
&uMaxLen ,
&uLen ,
pData ,
&fNull ) ;
Next, we call again, expecting an integer, which we ignore unless it is greater than zero.
if ( nArgs > 1 )
{
LONG nInputLen ;
srv_paraminfo (
pSrvProc ,
PARAM_ORDINAL_2 ,
&cType ,
&uMaxLen ,
&uLen ,
( BYTE* ) &nInputLen ,
&fNull ) ;
if ( IsValidPlaintextLen ( cType ) )
{
if ( nInputLen >= 0 )
{
uDataLen = ( ULONG ) nInputLen ;
}
}
else
IsValidPlaintextLen
is similar to IsValidPlaintextType
, but it's looking for integers.
At last, it's time to digest the text.
MD5String ( pData , uDataLen , szHash ) ;
szHash
is an array of 33 characters (32 for the digest, plus the obligatory trailing null, to make it a valid C string). It gets returned to the caller.
Since we are finished with pData
, and its contents are sensitive, we call SecureZeroMemory
to clear it, then free
to discard it.
SecureZeroMemory ( pData , psize ) ;
free ( pData ) ;
Contrary to the acquisition of pData
, there are no shortcuts to clearing and freeing it, because HeapFree
doesn't have an option to tell Windows to clear the memory before it is released.
That being the case, and since we got it from malloc
, we may as well call free
.
Regardless, we'll be using HeapFree
, either directly or indirectly, because free
does so under the covers.
If the argument count is 3, symbolized by FILL_CALLERS_BUFFER
, we call srv_paramsetoutput
to return a pointer to the hash.
if ( nArgs > FILL_CALLERS_BUFFER )
{
if ( srv_paramsetoutput (
pSrvProc ,
PARAM_ORDINAL_3 ,
( BYTE* ) szHash ,
HEX_CHAR_HASH_SIZE ,
FALSE ) == FAIL )
Note that szHash
is cast to a pointer to BYTE
, even though it is actually an array of type unsigned char
. It doesn't matter what type of data you return, the pointer is always a pointer to bytes, and the length is always specified in bytes.
If the caller doesn't pass a pointer for an out
parameter, we must call srv_describe
once, followed by a call to srv_sendrow
, to return our digest, as a table of one row and one column.
{ if ( srv_describe (
pSrvProc ,
1 ,
"MD5" ,
SRV_NULLTERM ,
SRVNCHAR ,
HEX_CHAR_HASH_SIZE ,
SRVNCHAR ,
HEX_CHAR_HASH_SIZE ,
( void* ) szHash ) == SRV_DESCRIBE_ERROR )
{
srv_sendmsg (
pSrvProc ,
SRV_MSG_ERROR ,
XP_MD5_SENDCOL_ERROR ,
SRV_INFO ,
XP_STATE_1 ,
NULL ,
0 ,
( DBUSMALLINT ) __LINE__ ,
"Extended Stored Procedure xp_md5W: An error was encountered "
"while returning the message digest as column 1 of row 1." ,
SRV_NULLTERM ) ;
srv_senddone (
pSrvProc ,
SRV_DONE_ERROR ,
0 ,
0 ) ;
return FAIL ;
}
if ( srv_sendrow ( pSrvProc ) == FAIL )
Significantly, both srv_paramsetoutput
and srv_describe
expect a value of 32, the number of characters in the digest represented as HEX_CHAR_HASH_SIZE
, for the size of the data block. Thus, it seems that it might be possible, though unwise, to allocate room for only 32 characters for the message digest, and dispense with the trailing null.
Unicode
There is almost no difference between the ANSI and Unicode versions of this stored procedure. However, there are two, and both are critical.
- The digest array,
szHash
, is of type wchar_t
. - The data lengths, which must be computed twice, are
HEX_CHAR_HASH_SIZE * sizeof ( wchar_t )
.
The data length issue tripped me up. The first couple of times I ran the Unicode function, I got back only half of the digest. Oops!
More often than not, the Visual C++ compiler does an amazingly good job of covering for you, but when it fails, the result is almost certain to be a buffer overflow. Not only did I leave half of my digest on the heap, but I left the heap in a corrupted state. Fortunately, the debugger brought that to my attention with a very clear error message. I was lucky; there have been plenty of instances in which I got no such warning, even in debug mode.
History
- Monday, 27 October 2008 is the first publication on The Code Project.