Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / database / SQL-Server

URL Decode in T-SQL

4.58/5 (6 votes)
1 Jul 2015CPOL5 min read 45.9K   97  
This article provides code for decoding strings that have been encoded for URL transmission in pure T-SQL that works properly for all URL encoded strings.
Download script - 983 byte

Background

Recently I had a requirement to decode URL encoded string in T-SQL. I tried finding solutions on the internet, but the solutions I found all tried to decode byte by byte, which works for 1-byte Unicode as its value is less than 128. For example, “%2f” will be decoded as “/”. But for 2-byte, 3-byte and 4-byte Unicode, decoding byte by byte will not work. For example, “%c2%ae” should be decoded as “®” instead of “®”, “%E2%84%A2” should be decoded as “™” instead of “â„¢”, “%e6%9d%a8” should be decoded as “杨” instead of  “杨”. I set out to write a function which solves these issues.

Now, recall that URL encoding converts non-ASCII characters to its byte sequence in UTF-8. It then adds a “%” in front of each pair of hexadecimal digits and replaces non-ASCII characters with its corresponding percent-encoding sequences.

SQL Server does not support UTF-8 encoding. Instead it supports the UTF-16 encoding.  Transformation from UTF-8 into UTF-16 is necessary. Since there is a built-in function “NCHAR” to return the Unicode character and “NCHAR” can accept decimal values, we just need to convert UTF-8 to decimal.

The initial byte of 2-, 3- and 4-byte UTF-8 starts with 2, 3 or 4 one bits, followed by a zero bit. The following bytes always start with the two-bit pattern 10, leaving 6 bits for data:

1-byte UTF-8 = 0xxxxxxxbin = 7 bits = 0-7Fhex

2-byte UTF-8 = 110xxxxx 10xxxxxxbin = 5+6(11) bits = 80-7FFhex

3-byte UTF-8 = 1110xxxx 10xxxxxx 10xxxxxxbin = 4+6+6(16) bits = 800-FFFFhex

4-byte UTF-8 = 11110xxx 10xxxxxx 10xxxxxx 10xxxxxxbin = 3+6+6+6(21) bits = 10000-10FFFFhex

Let’s focus on the first byte, the minimum and maximum values for the first byte are:

Bytes Byte 1 Min-Byte1 Max-Byte1
1-byte UTF-8 0xxxxxxx 0x00  (0) 0x7F  (127)
2-byte UTF-8 110xxxxx 0xC0  (192) 0xDF  (223)
3-byte UTF-8 1110xxxx 0xE0  (224) 0xEF  (239)
4-byte UTF-8 11110xxx 0xF0  (240) 0xF7  (247)

I will use the value of “Byte1” to determine how many bytes of UTF-8 we should to use to translate to a Unicode character.

The code

SQL
CREATE FUNCTION [dbo].[UrlDecode]
(
    @URL NVARCHAR(4000)
)  
RETURNS NVARCHAR(4000)
AS
BEGIN
    DECLARE @Position INT, @Base CHAR(16), @High TINYINT, @Low TINYINT, @Pattern CHAR(21)
    DECLARE @Byte1Value INT, @SurrogateHign INT, @SurrogateLow INT
    SELECT @Pattern = '%[%][0-9a-f][0-9a-f]%', @Position = PATINDEX(@Pattern, @URL)

    WHILE @Position > 0
    BEGIN
       SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
              @Low  = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
              @High = @High / 17 * 10 + @High % 17,
              @Low  = @Low  / 17 * 10 + @Low  % 17,
              @Byte1Value = 16 * @High + @Low
       IF @Byte1Value < 128 --1-byte UTF-8
          SELECT @URL = STUFF(@URL, @Position, 3, NCHAR(@Byte1Value)),
                 @Position = PATINDEX(@Pattern, @URL)
       ELSE IF @Byte1Value >= 192 AND @Byte1Value < 224 AND @Position > 0 --2-byte UTF-8
       BEGIN
           SELECT @Byte1Value = (@Byte1Value & (POWER(2,5) - 1)) * POWER(2,6),
                  @URL = STUFF(@URL, @Position, 3, ''),
                  @Position = PATINDEX(@Pattern, @URL)
           IF @Position > 0
              SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
                     @Low  = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
                     @High = @High / 17 * 10 + @High % 17,
                     @Low  = @Low  / 17 * 10 + @Low  % 17,
                     @Byte1Value = @Byte1Value + ((16 * @High + @Low) & (POWER(2,6) - 1)),
                     @URL = STUFF(@URL, @Position, 3, NCHAR(@Byte1Value)),
                     @Position = PATINDEX(@Pattern, @URL)
       END
       ELSE IF @Byte1Value >= 224 AND @Byte1Value < 240 AND @Position > 0 --3-byte UTF-8
       BEGIN
           SELECT @Byte1Value = (@Byte1Value & (POWER(2,4) - 1)) * POWER(2,12),
                  @URL = STUFF(@URL, @Position, 3, ''),
                  @Position = PATINDEX(@Pattern, @URL)
           IF @Position > 0
              SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
                     @Low  = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
                     @High = @High / 17 * 10 + @High % 17,
                     @Low  = @Low  / 17 * 10 + @Low  % 17,
                     @Byte1Value = @Byte1Value + ((16 * @High + @Low) & (POWER(2,6) - 1)) * POWER(2,6),
                     @URL = STUFF(@URL, @Position, 3, ''),
                     @Position = PATINDEX(@Pattern, @URL)
           IF @Position > 0
              SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
                     @Low  = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
                     @High = @High / 17 * 10 + @High % 17,
                     @Low  = @Low  / 17 * 10 + @Low  % 17,
                     @Byte1Value = @Byte1Value + ((16 * @High + @Low) & (POWER(2,6) - 1)),
                     @URL = STUFF(@URL, @Position, 3, NCHAR(@Byte1Value)),
                     @Position = PATINDEX(@Pattern, @URL)
       END
       ELSE IF @Byte1Value >= 240 AND @Position > 0  --4-byte UTF-8
       BEGIN
           SELECT @Byte1Value = (@Byte1Value & (POWER(2,3) - 1)) * POWER(2,18),
                  @URL = STUFF(@URL, @Position, 3, ''),
                  @Position = PATINDEX(@Pattern, @URL)
           IF @Position > 0
              SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
                     @Low  = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
                     @High = @High / 17 * 10 + @High % 17,
                     @Low  = @Low  / 17 * 10 + @Low  % 17,
                     @Byte1Value = @Byte1Value + ((16 * @High + @Low) & (POWER(2,6) - 1)) * POWER(2,12),
                     @URL = STUFF(@URL, @Position, 3, ''),
                     @Position = PATINDEX(@Pattern, @URL)
           IF @Position > 0
              SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
                     @Low  = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
                     @High = @High / 17 * 10 + @High % 17,
                     @Low  = @Low  / 17 * 10 + @Low  % 17,
                     @Byte1Value = @Byte1Value + ((16 * @High + @Low) & (POWER(2,6) - 1)) * POWER(2,6),
                     @URL = STUFF(@URL, @Position, 3, ''),
                     @Position = PATINDEX(@Pattern, @URL)
           IF @Position > 0
           BEGIN
              SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
                     @Low  = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
                     @High = @High / 17 * 10 + @High % 17,
                     @Low  = @Low  / 17 * 10 + @Low  % 17,
                     @Byte1Value = @Byte1Value + ((16 * @High + @Low) & (POWER(2,6) - 1))
                     --,@URL = STUFF(@URL, @Position, 3, cast(@Byte1Value as varchar))
                     --,@Position = PATINDEX(@Pattern, @URL)

              SELECT @SurrogateHign = ((@Byte1Value - POWER(16,4)) & (POWER(2,20) - 1)) / POWER(2,10) + 13 * POWER(16,3) + 8 * POWER(16,2),
                     @SurrogateLow = ((@Byte1Value - POWER(16,4)) & (POWER(2,10) - 1)) + 13 * POWER(16,3) + 12 * POWER(16,2),
                     @URL = STUFF(@URL, @Position, 3, NCHAR(@SurrogateHign) + NCHAR(@SurrogateLow)),
                     @Position = PATINDEX(@Pattern, @URL)
           END
       END
    END
    RETURN REPLACE(@URL, '+', ' ')
END

Using the code

Using the code is simple. The format is:

SQL
SELECT dbo.UrlDecode('URL Encoded String')

Example 1: 1-byte, 2-byte and 3-byte UTF-8 (English)

SQL
SELECT dbo.UrlDecode ('http://domain.com/search?keyword=2nd+Generation+Intel%C2%AE+Core%E2%84%A2+Processors+with+Intel%C2%AE+HD+Graphics+3000%2F2000')

The result is:

http://domain.com/search?keyword=2nd Generation Intel® Core™ Processors with Intel® HD Graphics 3000/2000

Example 2: 2-byte and 3-byte UTF-8 (Simplified Chinese)

SQL
SELECT dbo.UrlDecode ('http://domain.com/search?keyword=%E8%8B%B1%E7%89%B9%E5%B0%94%C2%AE+%E9%85%B7%E7%9D%BF%E2%84%A2+M+%E5%A4%84%E7%90%86%E5%99%A8')

The result is:

http://domain.com/search?keyword=英特尔® 酷睿™ M 处理器

Example 3: 2-byte and 3-byte UTF-8 (Arabic)

SQL
SELECT dbo.UrlDecode ('http://domain.com/search?keyword=%D8%A5%D9%86%D8%AA%D9%84+%C2%AE+%D9%83%D9%88%D8%B1+%E2%84%A2+M+%D8%A7%D9%84%D9%85%D8%B9%D8%A7%D9%84%D8%AC%0D%0A')

The result is:

 http://domain.com/search?keyword=إنتل ® كور ™ M المعالج

Example 4: 2-byte and 3-byte UTF-8 (Japanese)

SQL
SELECT dbo.UrlDecode ('http://domain.com/search?keyword=%E3%82%A4%E3%83%B3%E3%83%86%E3%83%AB%C2%AE+Core%E2%84%A2+M+%E3%83%97%E3%83%AD%E3%82%BB%E3%83%83%E3%82%B5%E3%83%BC%0D%0A')

The result is:

http://domain.com/search?keyword=インテル® Core™ M プロセッサー

Example 5: 4-byte UTF-8 (Chinese)

SQL
SELECT dbo.UrlDecode ('%F0%A9%A3%91%F0%A3%85%BF')

The result is:

Explanations

 

1. Calculate decimal value of a hexadecimal digit:
SQL
SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
       @Low  = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
       @High = @High / 17 * 10 + @High % 17,
       @Low  = @Low  / 17 * 10 + @Low  % 17

The calculation is illustrated as following:

  Hex Digit  0 1 2 3 4 5 6 7 8 9 A B C D E F
(1) = ASCII Value 48 49 50 51 52 53 54 55 56 57 65 66 67 68 69 70
(2) = (1) - 48 0 1 2 3 4 5 6 7 8 9 17 18 19 20 21 22
(3) = (2) / 17 *10 0 0 0 0 0 0 0 0 0 0 10 10 10 10 10 10
(4) = (2) % 17 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
(5) = (3) + (4) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

2. Translate UTF-8 encoding bytes to its corresponding character

Let’s use 2-byte UTF-8 as an example. 2-byte UTF-8 in binary looks like:

Byte1: 110xxxxx
Byte2: 10xxxxxx

Remove “110” from byte1 and “10” from byte2. Concatenate both binaries together as “xxxxx xxxxxx”. Calculate its decimal value, and convert it to a UTF-16 character.

SQL
SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
       @Low  = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
       @High = @High / 17 * 10 + @High % 17,
       @Low  = @Low  / 17 * 10 + @Low  % 17,
       @Byte1Value = 16 * @High + @Low

SELECT @Byte1Value = (@Byte1Value & (POWER(2,5) - 1)) * POWER(2,6)

SELECT @High = ASCII(UPPER(SUBSTRING(@URL, @Position + 1, 1))) - 48,
       @Low  = ASCII(UPPER(SUBSTRING(@URL, @Position + 2, 1))) - 48,
       @High = @High / 17 * 10 + @High % 17,
       @Low  = @Low  / 17 * 10 + @Low  % 17,
       @Byte1Value = @Byte1Value + ((16 * @High + @Low) & (POWER(2,6) - 1))

SELECT NCHAR(@Byte1Value)
  • The first select statement is to get the first byte and calculate its decimal value.
  • The second select statement removes “110” from byte1 by using Bitwise AND with (25 – 1), which is 0x1F in hexadecimal or 0001 1111 in binary. Then, left shift the result by 6 bits by multiplying the result by 26.
  • The third select statement takes the second byte and removes “10” by using Bitwise AND with (26 – 1), which is 0x3F in hex or 0011 1111 in binary. Then, add the value of byte1 and byte2 together.
  • The fourth select statement translates the value to its corresponding Unicode character.

3. Consider if the default database collation supports supplementary characters

For the 4-byte UTF-8, the value ranges from 65536 to 1113111 (0x10000 to 0x10FFFF). According to MSDN, if the collation of the database does not contain the supplementary character (SC) flag, function NCHAR will return NULL. The database collation I used to test is “SQL_Latin1_General_CP1_CI_AS”, and it does not contain the SC flag, I will need to transform it to the corresponding surrogate pair using the following algorithm:

  • Subtract 0x10000;
  • Split this into the high 10-bit value and the low 10-bit value;
  • Add 0xD800 to the high value to form the high surrogate;
  • Add 0xDC00 to the low value to form the low surrogate.

The code is:

SQL
SELECT @SurrogateHign = ((@Byte1Value - POWER(16,4)) & (POWER(2,20) - 1)) / POWER(2,10) + 13 * POWER(16,3) + 8 * POWER(16,2),
       @SurrogateLow  = ((@Byte1Value - POWER(16,4)) & (POWER(2,10) - 1)) + 13 * POWER(16,3) + 12 * POWER(16,2),
       @URL = STUFF(@URL, @Position, 3, NCHAR(@SurrogateHign) + NCHAR(@SurrogateLow))

If the database contains the supplementary character (SC) flag, you may comment out the above transformation and uncomment the following code located at the 4-byte UTF-8 decoding part above the surrogate code:

SQL
--,@URL = STUFF(@URL, @Position, 3, cast(@Byte1Value as varchar))
--,@Position = PATINDEX(@Pattern, @URL)

History

  • 01 Jul 2015 - Initial public draft;
  • 10 Sep 2015 - Change the display style of the article. Code is unchanged.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)