Introduction
A common problem when converting legacy database applications is that customers have a lot of data and they don't want to loose it. Data strings are usually encoded in ASCII /Multibyte, but our new system uses NVARCHAR
fields, so the problem is that we have to convert old data into Unicode. This is exactly what we'll do here.
I have to highlight that the approach I will present is 'oversized' for most cases: in a common scenario, you will deal with a single codepage; in that case, SQL Server built-in tools are enough. But there are situations that require a more advanced approach, such as when you have a legacy database that stores strings with different codepages.
As a complete example to show you capabilities of SQL-CLR integration, I've also decided to use a Win32 API to perform the conversion so that you can also see how to use P/Invoke from SQL. This can be useful if you have an old DLL and want to use it... but *beware* that it can be really dangerous... if you don't have full knowledge of what you're doing, you can keep down the entire SQL Server process !!!
Just a note: I do not provide a test project since the code I will show here is really simple, you can use copy & paste, and it's faster!
Solution
The solution I want to show is simple and powerful at the same time. It is made of two logical parts: first, we'll build the conversion routine as a standard C# function. Second, we'll integrate it into SQL Server as a function, so you can use with T-SQL.
Clearly, depending on your application scenario, this may not be the best approach, so once you have the conversion routine, you may choose to follow my approach or you may prefer to use it in an external application that performs a batch conversion, or you may think of something else.
Part #1: Create the conversion routine
I've used the following well known Win32 API:
int MultiByteToWideChar(
UINT CodePage,
DWORD dwFlags,
LPCSTR lpMultiByteStr,
int cbMultiByte,
LPWSTR lpWideCharStr,
int cchWideChar
);
Here, there's the C# signature that can be used for P/Invoking:
[DllImport("kernel32.dll")]
private static extern int MultiByteToWideChar(
uint CodePage,
uint dwFlags,
[MarshalAs(UnmanagedType.LPArray)] Byte[] lpMultiByteStr,
int cbMultiByte,
[Out, MarshalAs(UnmanagedType.LPArray)] Byte[] lpWideCharStr,
int cchWideChar);
Now it's quite easy to write a class that performs a conversion:
namespace ConvUtils {
public static class Unicode {
public static SqlString ConvToUnicode(SqlInt32 codepage , SqlString multibyteString) {
byte[] b = (byte[])iConvToMultibyteArray(multibyteString);
return (SqlString)ToUnicode((uint)(int)codepage, b);
}
private static string ToUnicode(uint codepage, Byte[] lpMultiByteStr) {
Byte[] lpWideCharStr = new Byte[2*lpMultiByteStr.Length];
MultiByteToWideChar(codepage, 0, lpMultiByteStr, lpMultiByteStr.Length,
lpWideCharStr, 2*lpMultiByteStr.Length);
return System.Text.Encoding.Unicode.GetString(lpWideCharStr);
}
private static SqlBinary iConvToMultibyteArray(SqlString multibyteString) {
byte[] result = multibyteString.GetUnicodeBytes();
return (SqlBinary)result;
}
}
}
The example is quite easy, and does not require any other explanation. The SQL types has been used because I will integrate it into SQL Server, but if you don't need it, you may replace them with strings and byte arrays.
I've also decided to let the codepage be a parameter since the codepage on your PC or on SQL Server can be different from the one needed for the conversion.
Part #2: SQL Server code
OK, now we have a C# code that converts ASCII/Multibyte into Unicode. The next step is to integrate it into SQL Server so that any database user can have access to this conversion routine:
First, create a DLL that can be hosted by SQL Server: all we have to do is to add the [SQLFunction]
attribute to ConvToUnicode
:
[SQLFunction]
public static SqlString ConvToUnicode( ...
Finally, build the DLL and integrate it into SQL Server with a script like the following:
use TESTDB
go
exec sp_configure "clr enabled", '1'
go
reconfigure
go
ALTER DATABASE TESTDB SET TRUSTWORTHY ON
go
begin try
CREATE ASSEMBLY [asmUni] FROM 'c:\project_output_dir\uniconv.dll'
with permission_set=UNSAFE
end try
begin catch
alter assembly [asmUni] FROM 'c:\project_output_dir\uniconv.dll' WITH UNCHECKED DATA
end catch
go
if exists(
select name from sys.objects where name =
'csConvToUnicode') drop function [dbo].[csConvToUnicode]
go
CREATE FUNCTION [dbo].[csConvToUnicode] (
@codepage int,
@multibytestr nvarchar(max)
) returns nvarchar(max)
AS EXTERNAL name [asmUni].[ConvUtils.Unicode].[ConvToUnicode]
go
That's all!
Now you can use this function like any other, for example, in a classic Select
statement, to create a View or to create a Trigger that automatically keeps your data updated.
Here is a final example of how we can use this function in a T-SQL statement (950 is the codepage for traditional Chinese):
select
description,
dbo.csConvToUnicode(950, description) as converted
from testtable
description converted
¨à µ£ºô¸ô¦w¥þ 兒童網路安全
°ê»Ú¸ê°T °T°T°T°T°T 國際資訊 訊訊訊訊訊
°ê»Ú¸ê°T °T°T°T°T°T 國際資訊 訊訊訊訊訊
a a
Ãô¨pÃ…v¬Fµ¦ éš±ç§æ¬Šæ"¿ç–
test c test c
Conclusion
I have shown a really simple but powerful way to leverage CLR integration provided by SQL Server to convert old ASCII/Multibyte data into Unicode. I hope that this is a good starting point for your personal solution.
History
- February 16, 2007 - Added more comments on introduction.
- February 8, 2007 - First version.