Background
Recently I had a requirement to decode URL encoded string in T-SQL. I tried finding solutions on the internet, but the solutions I found all tried to decode byte by byte, which works for 1-byte Unicode as its value is less than 128. For example, “%2f” will be decoded as “/”. But for 2-byte, 3-byte and 4-byte Unicode, decoding byte by byte will not work. For example, “%c2%ae” should be decoded as “®” instead of “®”, “%E2%84%A2” should be decoded as “™” instead of “â„¢”, “%e6%9d%a8” should be decoded as “杨” instead of “樔. I set out to write a function which solves these issues.
Now, recall that URL encoding converts non-ASCII characters to its byte sequence in UTF-8. It then adds a “%” in front of each pair of hexadecimal digits and replaces non-ASCII characters with its corresponding percent-encoding sequences.
SQL Server does not support UTF-8 encoding. Instead it supports the UTF-16 encoding. Transformation from UTF-8 into UTF-16 is necessary. Since there is a built-in function “NCHAR” to return the Unicode character and “NCHAR” can accept decimal values, we just need to convert UTF-8 to decimal.
The initial byte of 2-, 3- and 4-byte UTF-8 starts with 2, 3 or 4 one bits, followed by a zero bit. The following bytes always start with the two-bit pattern 10, leaving 6 bits for data:
1-byte UTF-8 = 0xxxxxxxbin = 7 bits = 0-7Fhex
2-byte UTF-8 = 110xxxxx 10xxxxxxbin = 5+6(11) bits = 80-7FFhex
3-byte UTF-8 = 1110xxxx 10xxxxxx 10xxxxxxbin = 4+6+6(16) bits = 800-FFFFhex
4-byte UTF-8 = 11110xxx 10xxxxxx 10xxxxxx 10xxxxxxbin = 3+6+6+6(21) bits = 10000-10FFFFhex
Let’s focus on the first byte, the minimum and maximum values for the first byte are:
Bytes | Byte 1 | Min-Byte1 | Max-Byte1 |
1-byte UTF-8 | 0xxxxxxx | 0x00 (0) | 0x7F (127) |
2-byte UTF-8 | 110xxxxx | 0xC0 (192) | 0xDF (223) |
3-byte UTF-8 | 1110xxxx | 0xE0 (224) | 0xEF (239) |
4-byte UTF-8 | 11110xxx | 0xF0 (240) | 0xF7 (247) |
I will use the value of “Byte1” to determine how many bytes of UTF-8 we should to use to translate to a Unicode character.
The code
CREATE FUNCTION [dbo].[UrlDecode]
(
@URL NVARCHAR(4000)
)
RETURNS NVARCHAR(4000)
AS
BEGIN
DECLARE @Position INT, @Base CHAR(16), @High TINYINT, @Low TINYINT, @Pattern CHAR(21)
DECLARE @Byte1Value INT, @SurrogateHign INT, @SurrogateLow INT
SELECT @Pattern = '%[%][0-9a-f][0-9a-f]%', @Position = PATINDEX(@Pattern, @URL)
WHILE @Position > 0
BEGIN
SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
@Low = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
@High = @High / 17 * 10 + @High % 17,
@Low = @Low / 17 * 10 + @Low % 17,
@Byte1Value = 16 * @High + @Low
IF @Byte1Value < 128
SELECT @URL = STUFF(@URL, @Position, 3, NCHAR(@Byte1Value)),
@Position = PATINDEX(@Pattern, @URL)
ELSE IF @Byte1Value >= 192 AND @Byte1Value < 224 AND @Position > 0
BEGIN
SELECT @Byte1Value = (@Byte1Value & (POWER(2,5) - 1)) * POWER(2,6),
@URL = STUFF(@URL, @Position, 3, ''),
@Position = PATINDEX(@Pattern, @URL)
IF @Position > 0
SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
@Low = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
@High = @High / 17 * 10 + @High % 17,
@Low = @Low / 17 * 10 + @Low % 17,
@Byte1Value = @Byte1Value + ((16 * @High + @Low) & (POWER(2,6) - 1)),
@URL = STUFF(@URL, @Position, 3, NCHAR(@Byte1Value)),
@Position = PATINDEX(@Pattern, @URL)
END
ELSE IF @Byte1Value >= 224 AND @Byte1Value < 240 AND @Position > 0
BEGIN
SELECT @Byte1Value = (@Byte1Value & (POWER(2,4) - 1)) * POWER(2,12),
@URL = STUFF(@URL, @Position, 3, ''),
@Position = PATINDEX(@Pattern, @URL)
IF @Position > 0
SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
@Low = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
@High = @High / 17 * 10 + @High % 17,
@Low = @Low / 17 * 10 + @Low % 17,
@Byte1Value = @Byte1Value + ((16 * @High + @Low) & (POWER(2,6) - 1)) * POWER(2,6),
@URL = STUFF(@URL, @Position, 3, ''),
@Position = PATINDEX(@Pattern, @URL)
IF @Position > 0
SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
@Low = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
@High = @High / 17 * 10 + @High % 17,
@Low = @Low / 17 * 10 + @Low % 17,
@Byte1Value = @Byte1Value + ((16 * @High + @Low) & (POWER(2,6) - 1)),
@URL = STUFF(@URL, @Position, 3, NCHAR(@Byte1Value)),
@Position = PATINDEX(@Pattern, @URL)
END
ELSE IF @Byte1Value >= 240 AND @Position > 0
BEGIN
SELECT @Byte1Value = (@Byte1Value & (POWER(2,3) - 1)) * POWER(2,18),
@URL = STUFF(@URL, @Position, 3, ''),
@Position = PATINDEX(@Pattern, @URL)
IF @Position > 0
SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
@Low = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
@High = @High / 17 * 10 + @High % 17,
@Low = @Low / 17 * 10 + @Low % 17,
@Byte1Value = @Byte1Value + ((16 * @High + @Low) & (POWER(2,6) - 1)) * POWER(2,12),
@URL = STUFF(@URL, @Position, 3, ''),
@Position = PATINDEX(@Pattern, @URL)
IF @Position > 0
SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
@Low = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
@High = @High / 17 * 10 + @High % 17,
@Low = @Low / 17 * 10 + @Low % 17,
@Byte1Value = @Byte1Value + ((16 * @High + @Low) & (POWER(2,6) - 1)) * POWER(2,6),
@URL = STUFF(@URL, @Position, 3, ''),
@Position = PATINDEX(@Pattern, @URL)
IF @Position > 0
BEGIN
SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
@Low = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
@High = @High / 17 * 10 + @High % 17,
@Low = @Low / 17 * 10 + @Low % 17,
@Byte1Value = @Byte1Value + ((16 * @High + @Low) & (POWER(2,6) - 1))
SELECT @SurrogateHign = ((@Byte1Value - POWER(16,4)) & (POWER(2,20) - 1)) / POWER(2,10) + 13 * POWER(16,3) + 8 * POWER(16,2),
@SurrogateLow = ((@Byte1Value - POWER(16,4)) & (POWER(2,10) - 1)) + 13 * POWER(16,3) + 12 * POWER(16,2),
@URL = STUFF(@URL, @Position, 3, NCHAR(@SurrogateHign) + NCHAR(@SurrogateLow)),
@Position = PATINDEX(@Pattern, @URL)
END
END
END
RETURN REPLACE(@URL, '+', ' ')
END
Using the code
Using the code is simple. The format is:
SELECT dbo.UrlDecode('URL Encoded String')
Example 1: 1-byte, 2-byte and 3-byte UTF-8 (English)
SELECT dbo.UrlDecode ('http://domain.com/search?keyword=2nd+Generation+Intel%C2%AE+Core%E2%84%A2+Processors+with+Intel%C2%AE+HD+Graphics+3000%2F2000')
The result is:
http://domain.com/search?keyword=2nd Generation Intel® Core™ Processors with Intel® HD Graphics 3000/2000
Example 2: 2-byte and 3-byte UTF-8 (Simplified Chinese)
SELECT dbo.UrlDecode ('http://domain.com/search?keyword=%E8%8B%B1%E7%89%B9%E5%B0%94%C2%AE+%E9%85%B7%E7%9D%BF%E2%84%A2+M+%E5%A4%84%E7%90%86%E5%99%A8')
The result is:
http://domain.com/search?keyword=英特尔® 酷睿™ M 处理器
Example 3: 2-byte and 3-byte UTF-8 (Arabic)
SELECT dbo.UrlDecode ('http://domain.com/search?keyword=%D8%A5%D9%86%D8%AA%D9%84+%C2%AE+%D9%83%D9%88%D8%B1+%E2%84%A2+M+%D8%A7%D9%84%D9%85%D8%B9%D8%A7%D9%84%D8%AC%0D%0A')
The result is:
http://domain.com/search?keyword=إنتل ® كور ™ M المعالج
Example 4: 2-byte and 3-byte UTF-8 (Japanese)
SELECT dbo.UrlDecode ('http://domain.com/search?keyword=%E3%82%A4%E3%83%B3%E3%83%86%E3%83%AB%C2%AE+Core%E2%84%A2+M+%E3%83%97%E3%83%AD%E3%82%BB%E3%83%83%E3%82%B5%E3%83%BC%0D%0A')
The result is:
http://domain.com/search?keyword=インテル® Core™ M プロセッサー
Example 5: 4-byte UTF-8 (Chinese)
SELECT dbo.UrlDecode ('%F0%A9%A3%91%F0%A3%85%BF')
The result is:
Explanations
1. Calculate decimal value of a hexadecimal digit:
SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
@Low = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
@High = @High / 17 * 10 + @High % 17,
@Low = @Low / 17 * 10 + @Low % 17
The calculation is illustrated as following:
| Hex Digit | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F |
(1) = | ASCII Value | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 65 | 66 | 67 | 68 | 69 | 70 |
(2) = | (1) - 48 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 17 | 18 | 19 | 20 | 21 | 22 |
(3) = | (2) / 17 *10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 10 | 10 | 10 | 10 | 10 | 10 |
(4) = | (2) % 17 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0 | 1 | 2 | 3 | 4 | 5 |
(5) = | (3) + (4) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
2. Translate UTF-8 encoding bytes to its corresponding character
Let’s use 2-byte UTF-8 as an example. 2-byte UTF-8 in binary looks like:
Byte1: 110xxxxx
Byte2: 10xxxxxx
Remove “110” from byte1 and “10” from byte2. Concatenate both binaries together as “xxxxx xxxxxx”. Calculate its decimal value, and convert it to a UTF-16 character.
SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
@Low = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
@High = @High / 17 * 10 + @High % 17,
@Low = @Low / 17 * 10 + @Low % 17,
@Byte1Value = 16 * @High + @Low
SELECT @Byte1Value = (@Byte1Value & (POWER(2,5) - 1)) * POWER(2,6)
SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
@Low = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
@High = @High / 17 * 10 + @High % 17,
@Low = @Low / 17 * 10 + @Low % 17,
@Byte1Value = @Byte1Value + ((16 * @High + @Low) & (POWER(2,6) - 1))
SELECT NCHAR(@Byte1Value)
- The first select statement is to get the first byte and calculate its decimal value.
- The second select statement removes “110” from byte1 by using Bitwise AND with (25 – 1), which is 0x1F in hexadecimal or 0001 1111 in binary. Then, left shift the result by 6 bits by multiplying the result by 26.
- The third select statement takes the second byte and removes “10” by using Bitwise AND with (26 – 1), which is 0x3F in hex or 0011 1111 in binary. Then, add the value of byte1 and byte2 together.
- The fourth select statement translates the value to its corresponding Unicode character.
3. Consider if the default database collation supports supplementary characters
For the 4-byte UTF-8, the value ranges from 65536 to 1113111 (0x10000 to 0x10FFFF). According to MSDN, if the collation of the database does not contain the supplementary character (SC) flag, function NCHAR will return NULL. The database collation I used to test is “SQL_Latin1_General_CP1_CI_AS”, and it does not contain the SC flag, I will need to transform it to the corresponding surrogate pair using the following algorithm:
- Subtract 0x10000;
- Split this into the high 10-bit value and the low 10-bit value;
- Add 0xD800 to the high value to form the high surrogate;
- Add 0xDC00 to the low value to form the low surrogate.
The code is:
SELECT @SurrogateHign = ((@Byte1Value - POWER(16,4)) & (POWER(2,20) - 1)) / POWER(2,10) + 13 * POWER(16,3) + 8 * POWER(16,2),
@SurrogateLow = ((@Byte1Value - POWER(16,4)) & (POWER(2,10) - 1)) + 13 * POWER(16,3) + 12 * POWER(16,2),
@URL = STUFF(@URL, @Position, 3, NCHAR(@SurrogateHign) + NCHAR(@SurrogateLow))
If the database contains the supplementary character (SC) flag, you may comment out the above transformation and uncomment the following code located at the 4-byte UTF-8 decoding part above the surrogate code:
History
- 01 Jul 2015 - Initial public draft;
- 10 Sep 2015 - Change the display style of the article. Code is unchanged.