In the Beginning, There was ASCIII
On both Windows/DOS and UNIX systems, the 128 most commonly-used characters are each represented by a sequence of 7 bits known as the character's ASCII code.
They are traditionally stored as bytes (8 bits), i.e., the 7-bit ASCII code plus a leading zero. The characters include letters, digits, punctuation marks, and nonprintable
control characters such as the backspace, tab, carriage return, etc.
The Need for a Universal Character Coding System
Traditional character sets (like the ANSI alphabet) base on 8 bit characters called a byte. A single byte can represent up to 256 different values and, thus, characters.
This is good enough to represent Western scripts such as those used in the English, French, or German language. However, if it comes to more complex languages
like Japanese or Korean, 256 different characters are simply insufficient.
So, users of these languages have developed the so-called double byte character sets, called DBCS. In DBCS, each character is represented by one or more bytes.
Character encoding specifies how to interpret the byte values and whether or not a byte is a single character or just part of a larger set of bytes representing a multi-byte character.
Unfortunately, there are many different DBCS encodings for a given language. To make matters worse, different Operating Systems and different programming languages
tend to use different DBCS encodings. Also, programming is relatively complex because of byte parsing operations.
Unicode's goal is to solve this issue by using more than one byte for each character. So with Unicode, all different characters can be stored in one string.
As all characters have a fixed width, programming complexity is greatly reduced.
...and There Came UNICODE
Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in any of the world's writing systems. Unicode consists
of a repertoire of about 100,000 characters. The Unicode Consortium is based in California and the members mainly include computer software and hardware companies with
any interest in text-processing standards, such as Adobe Systems, Apple, HP, IBM, Microsoft, Xerox, and many others.
Implementation of Unicode as UCS and UTF
Several mechanisms have been specified for implementing Unicode. Which one implementers choose depends on the available storage space, source code compatibility, and interoperability
with other systems.
Unicode defines two mapping methods: the Unicode Transformation Format (UTF) encodings, and the Universal Character Set (UCS) encodings. An encoding
maps (possibly a subset of) the range of Unicode code points to sequences of values in some fixed-size range, termed "code values". The Universal Character Set (UCS-2)
format is a fixed-length character string where each character is represented by 2 bytes.
UTF encodings include:
- UTF-7 — a relatively unpopular 7-bit encoding, often considered obsolete (not part of The Unicode Standard but rather an RFC)
- UTF-8 — an 8-bit, variable-width encoding, which maximizes compatibility with ASCII.
- UTF-EBCDIC — an 8-bit variable-width encoding, which maximizes compatibility with EBCDIC (not part of The Unicode Standard).
- UTF-16 — a 16-bit, variable-width encoding.
- UTF-32 — a 32-bit, fixed-width encoding.
SQL Server 7.0 and SQL Server 2000 use a different Unicode encoding (UCS-2) and do not recognize UTF-8 as valid character data.
SQL Server Supports Only UCS-2 Unicode
In a computer, characters are represented by different patterns of bits being either ON or OFF. There are 8 bits in a byte, and the 8 bits can be turned ON and OFF
in 256 different patterns. A program that uses 1 byte to store each character can therefore represent up to 256 different characters by assigning a character to each of the bit patterns.
Collations
The physical storage of character strings in SQL Server is controlled by collations. A collation specifies the bit patterns that represent each character and the rules
by which characters are sorted and compared. SQL Server supports objects that have different collations being stored in a single database. Separate SQL Server 2000 collations
can be specified down to the level of columns. Each column in a table can be assigned different collations.
Microsoft SQL Server collation settings depend on the type of installation. In general, choose a SQL Server collation that supports the Microsoft Windows system locale most
commonly used at your site. To identify your Windows system locale on computers running Microsoft Windows 2000, click Regional Settings in Control Panel, and then click
the General tab if necessary to display the current system locale.
In most cases, a computer runs the Windows system locale that matches the language requirements of the user, so Microsoft SQL Server Setup automatically detects the Windows
system locale and chooses the appropriate collation.
Unicode Support of SQL Server
One problem with data types that use 1 byte to encode each character is that the data type can only represent 256 different characters. This forces multiple encoding
specifications (or code pages) for different alphabets such as European alphabets, which are relatively small. It is also impossible to handle systems such
as the Japanese Kanji or Korean Hangul alphabets that have thousands of characters.
The Unicode specification addresses this problem by using two bytes to encode each character. There are enough different patterns (65,536) in 2 bytes for a single specification
covering the most common business languages. Because all Unicode systems consistently use the same bit patterns to represent all characters, there is no problem with characters
being converted incorrectly when moving from one system to another. You can minimize character conversion issues by using Unicode data types throughout your system.
Unicode support was introduced to SQL Server starting with SQL Server 7.0. Microsoft SQL Server allows the storage of Unicode data only in UCS-2 encoding.
Data Types
In Microsoft SQL Server, these data types support Unicode data:
Note: The n
prefix for these data types comes from the SQL-92 standard for National (Unicode) data types. Use of nchar
, nvarchar
,
and ntext
is the same as char
, varchar
, and text
, respectively, except that:
- Unicode supports a wider range of characters
- More space is needed to store Unicode characters
- The maximum size of
nchar
and nvarchar
columns is 4,000 characters, not 8,000 characters like char
and varchar
- Unicode constants are specified with a leading
N
: N'A Unicode string' - All Unicode data uses the same Unicode code page. Collations do not control the code page used for Unicode columns, only attributes such as comparison rules and case sensitivity
UNICODE Function
It returns the integer value, as defined by the Unicode standard, for the first character of the input expression. The Unicode function has the following syntax:
UNICODE ( 'ncharacter_expression' )
'ncharacter_expression'
is an nchar
or nvarchar
expression and the return type is int
.
How to Store UTF-8 Data in SQL Server
The following options are available to store UTF-8 encoded data in SQL Server.
Translate to and from UCS-2/UTF-8
Sample code for this type of conversion is located at the Unicode Consortium's site. A high-level description
of the algorithm to convert UCS-2 to UTF-8 can be found in the Internet Request For Comments document RFC2279.
On Windows NT or Windows 2000, you may use the Win32 functions MultiByteToWideChar
and WideCharToMultiByte
to convert UTF-8 to and from UCS-2
by passing the constant CP_UTF8
(65001) as the first parameter to the functions.
Use BINARY/VARBINARY/IMAGE Data Types
Store the actual UTF-8 data on the server using the BINARY
/VARBINARY
/IMAGE
columns. Storing UTF-8 data on SQL Server means that
you can not use SQL Server to sort or find ranges of these values as if the data were valid character data. The types of operations on columns containing UTF-8 data that
would not return expected results include ORDER BY
, greater-than ">
" and less-than "<
" comparisons, and the built-in SQL Server
string manipulation functions such as SUBSTRING()
.
However, equality comparisons will work as long as the strings being compared are equivalent at a byte level. Note that if you store UTF-8 data in SQL Server,
you should not use character columns (CHAR
/NCHAR
/VARCHAR
and so forth). UTF-8 is not valid character data to SQL Server,
and by storing non-character data in character columns, you risk encountering problems such as the issues discussed in the following Microsoft Knowledge Base articles:
- 155723 INF: SQL Server Truncation of a DBCS String
- 234748 PRB: SQL Server ODBC Driver Converts Language Events to Unicode
If you are considering this option, keep in mind that if you ever need to access UTF-8 data stored inside SQL Server from any application other than
a Web browser (for example, from a non-Web-based ODBC application), you will need to do a conversion from UTF-8 to UCS-2 within this application. This is because ODBC,
OLEDB, COM, Win32 API calls, VB, and C runtime string manipulation functions do not work with UTF-8 data. This moves the burden of translation to a different application.