Implementation of Unicode in SQL Server

josekonoor

3.63/5 (4 votes)

25 Oct 2007CPOL7 min read

Information about Unicode implementation in SQL Server.

In the Beginning, There was ASCIII

On both Windows/DOS and UNIX systems, the 128 most commonly-used characters are each represented by a sequence of 7 bits known as the character's ASCII code. They are traditionally stored as bytes (8 bits), i.e., the 7-bit ASCII code plus a leading zero. The characters include letters, digits, punctuation marks, and nonprintable control characters such as the backspace, tab, carriage return, etc.

The Need for a Universal Character Coding System

Traditional character sets (like the ANSI alphabet) base on 8 bit characters called a byte. A single byte can represent up to 256 different values and, thus, characters. This is good enough to represent Western scripts such as those used in the English, French, or German language. However, if it comes to more complex languages like Japanese or Korean, 256 different characters are simply insufficient.

So, users of these languages have developed the so-called double byte character sets, called DBCS. In DBCS, each character is represented by one or more bytes. Character encoding specifies how to interpret the byte values and whether or not a byte is a single character or just part of a larger set of bytes representing a multi-byte character.

Unfortunately, there are many different DBCS encodings for a given language. To make matters worse, different Operating Systems and different programming languages tend to use different DBCS encodings. Also, programming is relatively complex because of byte parsing operations.

Unicode's goal is to solve this issue by using more than one byte for each character. So with Unicode, all different characters can be stored in one string. As all characters have a fixed width, programming complexity is greatly reduced.

...and There Came UNICODE

Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in any of the world's writing systems. Unicode consists of a repertoire of about 100,000 characters. The Unicode Consortium is based in California and the members mainly include computer software and hardware companies with any interest in text-processing standards, such as Adobe Systems, Apple, HP, IBM, Microsoft, Xerox, and many others.

Implementation of Unicode as UCS and UTF

Several mechanisms have been specified for implementing Unicode. Which one implementers choose depends on the available storage space, source code compatibility, and interoperability with other systems.

Unicode defines two mapping methods: the Unicode Transformation Format (UTF) encodings, and the Universal Character Set (UCS) encodings. An encoding maps (possibly a subset of) the range of Unicode code points to sequences of values in some fixed-size range, termed "code values". The Universal Character Set (UCS-2) format is a fixed-length character string where each character is represented by 2 bytes.

UTF encodings include:

UTF-7 — a relatively unpopular 7-bit encoding, often considered obsolete (not part of The Unicode Standard but rather an RFC)
UTF-8 — an 8-bit, variable-width encoding, which maximizes compatibility with ASCII.
UTF-EBCDIC — an 8-bit variable-width encoding, which maximizes compatibility with EBCDIC (not part of The Unicode Standard).
UTF-16 — a 16-bit, variable-width encoding.
UTF-32 — a 32-bit, fixed-width encoding.

SQL Server 7.0 and SQL Server 2000 use a different Unicode encoding (UCS-2) and do not recognize UTF-8 as valid character data.

SQL Server Supports Only UCS-2 Unicode

In a computer, characters are represented by different patterns of bits being either ON or OFF. There are 8 bits in a byte, and the 8 bits can be turned ON and OFF in 256 different patterns. A program that uses 1 byte to store each character can therefore represent up to 256 different characters by assigning a character to each of the bit patterns.

Collations

The physical storage of character strings in SQL Server is controlled by collations. A collation specifies the bit patterns that represent each character and the rules by which characters are sorted and compared. SQL Server supports objects that have different collations being stored in a single database. Separate SQL Server 2000 collations can be specified down to the level of columns. Each column in a table can be assigned different collations.

Microsoft SQL Server collation settings depend on the type of installation. In general, choose a SQL Server collation that supports the Microsoft Windows system locale most commonly used at your site. To identify your Windows system locale on computers running Microsoft Windows 2000, click Regional Settings in Control Panel, and then click the General tab if necessary to display the current system locale.

In most cases, a computer runs the Windows system locale that matches the language requirements of the user, so Microsoft SQL Server Setup automatically detects the Windows system locale and chooses the appropriate collation.

Unicode Support of SQL Server

One problem with data types that use 1 byte to encode each character is that the data type can only represent 256 different characters. This forces multiple encoding specifications (or code pages) for different alphabets such as European alphabets, which are relatively small. It is also impossible to handle systems such as the Japanese Kanji or Korean Hangul alphabets that have thousands of characters.

The Unicode specification addresses this problem by using two bytes to encode each character. There are enough different patterns (65,536) in 2 bytes for a single specification covering the most common business languages. Because all Unicode systems consistently use the same bit patterns to represent all characters, there is no problem with characters being converted incorrectly when moving from one system to another. You can minimize character conversion issues by using Unicode data types throughout your system.

Unicode support was introduced to SQL Server starting with SQL Server 7.0. Microsoft SQL Server allows the storage of Unicode data only in UCS-2 encoding.

Data Types

In Microsoft SQL Server, these data types support Unicode data:

nchar
nvarchar
ntext

Note: The n prefix for these data types comes from the SQL-92 standard for National (Unicode) data types. Use of nchar, nvarchar, and ntext is the same as char, varchar, and text, respectively, except that:

Unicode supports a wider range of characters
More space is needed to store Unicode characters
The maximum size of nchar and nvarchar columns is 4,000 characters, not 8,000 characters like char and varchar
Unicode constants are specified with a leading N: N'A Unicode string'
All Unicode data uses the same Unicode code page. Collations do not control the code page used for Unicode columns, only attributes such as comparison rules and case sensitivity

UNICODE Function

It returns the integer value, as defined by the Unicode standard, for the first character of the input expression. The Unicode function has the following syntax:

UNICODE ( 'ncharacter_expression' )

'ncharacter_expression' is an nchar or nvarchar expression and the return type is int.

How to Store UTF-8 Data in SQL Server

The following options are available to store UTF-8 encoded data in SQL Server.

Translate to and from UCS-2/UTF-8

Sample code for this type of conversion is located at the Unicode Consortium's site. A high-level description of the algorithm to convert UCS-2 to UTF-8 can be found in the Internet Request For Comments document RFC2279.

On Windows NT or Windows 2000, you may use the Win32 functions MultiByteToWideChar and WideCharToMultiByte to convert UTF-8 to and from UCS-2 by passing the constant CP_UTF8 (65001) as the first parameter to the functions.

Use BINARY/VARBINARY/IMAGE Data Types

Store the actual UTF-8 data on the server using the BINARY/VARBINARY/IMAGE columns. Storing UTF-8 data on SQL Server means that you can not use SQL Server to sort or find ranges of these values as if the data were valid character data. The types of operations on columns containing UTF-8 data that would not return expected results include ORDER BY, greater-than ">" and less-than "<" comparisons, and the built-in SQL Server string manipulation functions such as SUBSTRING().

However, equality comparisons will work as long as the strings being compared are equivalent at a byte level. Note that if you store UTF-8 data in SQL Server, you should not use character columns (CHAR/NCHAR/VARCHAR and so forth). UTF-8 is not valid character data to SQL Server, and by storing non-character data in character columns, you risk encountering problems such as the issues discussed in the following Microsoft Knowledge Base articles:

155723 INF: SQL Server Truncation of a DBCS String
234748 PRB: SQL Server ODBC Driver Converts Language Events to Unicode

If you are considering this option, keep in mind that if you ever need to access UTF-8 data stored inside SQL Server from any application other than a Web browser (for example, from a non-Web-based ODBC application), you will need to do a conversion from UTF-8 to UCS-2 within this application. This is because ODBC, OLEDB, COM, Win32 API calls, VB, and C runtime string manipulation functions do not work with UTF-8 data. This moves the burden of translation to a different application.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)