Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Mobile / WinMobile

Case-Insensitive Sort of UTF8 Data Using System.Data.SQLite

4.92/5 (20 votes)
22 Nov 2009CPOL4 min read 79.2K   826  
SQLite lacks case-insensitive sort of UTF8 data. In this article, you will see how to get rid of this limitation in .NET.

Introduction

If you are using SQLite in .NET, perhaps you could have the problem of case-insensitive sorting of UTF8 strings.

Background

This code is useful if you use System.Data.SQLite, which can be downloaded here: Download System.Data.SQLite. I have been using SQLite for a long time and stored Russian strings in it. As it's stated in SQLite FAQ, case-insensitive matching of Unicode characters does not work. This problem has become very significant when the amount of stored data has grown. System.Data.SQLite provides a cool solution for that but it's unclear how to implement it.

Using the Code

Using the code is pretty simple: just add this line in the beginning of your application:

C#
SQLiteFunction.RegisterFunction(typeof(SQLiteCaseInsensitiveCollation));

The Database

Okay, let us start. First of all, SQLite has three built-in collations (if you need more information, you can read it on SQLite's site):

  • BINARY – All the characters are treated as bytes
  • NOCASE – Compares ASCII characters in case-insensitive manner; others are treated as bytes
  • RTRIM – Just skips trailing spaces

We want to create the following table:

IDNameVal (some value stored in the table)Comment
1Foo123 
2foo234 
3Bar345 
4bar456 
5Буква567A sample Russian word starting from uppercase letter
6буква678The same Russian word starting from lowercase letter
7Тест789Another Russian word starting from uppercase letter
8тест890The same Russian word starting from lowercase letter

Here's SQL to create it:

SQL
CREATE TABLE `testtbl` (
    `ID` INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
    `Name` TEXT NOT NULL COLLATE NOCASE, 
    `Val` REAL)   

No we'll run a simple SELECT:

SQL
SELECT * FROM `testtbl` ORDER BY `Name`

We'll get the result:

IDNameVal (some value stored in the table)Comment
2Bar345 
3bar456 
1Foo123 
2foo234 
5Буква567Incorrect sort!
7Тест789
6буква678
8тест890

As you see, data in Russian is sorted incorrectly. Correct would be the following order: Буква, буква, Тест, тест.

Now we'll fix this. It would be perfect if SQLite could sort by this expression (in C#):

C#
string.Compare(x, y, CultureInfo.CreateSpecificCulture("ru-RU"), 
	CompareOptions.IgnoreCase);

Defining Case-Insensitive Collation in C# Code

I have created a simple class SQLiteCaseInsensitiveCollation that adds collation to SQLite engine:

C#
using System.Data.SQLite;
using System.Globalization;

namespace SQLiteUTF8CIComparison {
    /// <summary />
    /// This function adds case-insensitive sort feature to SQLite engine 
    /// To initialize, use SQLiteFunction.RegisterFunction() 
    /// before all connections are open 
    /// </summary />
    [SQLiteFunction(FuncType = FunctionType.Collation, Name = "UTF8CI")]
    public class SQLiteCaseInsensitiveCollation : SQLiteFunction {
        /// <summary />
        /// CultureInfo for comparing strings in case insensitive manner 
        /// </summary />
        private static readonly CultureInfo _cultureInfo = 
			CultureInfo.CreateSpecificCulture("ru-RU");

        /// <summary />
        /// Does case-insensitive comparison using _cultureInfo 
        /// </summary />
        /// Left string
        /// Right string
        /// <returns />The result of a comparison</returns />
        public override int Compare(string x, string y) {
            return string.Compare(x, y, _cultureInfo, CompareOptions.IgnoreCase);
        }
    }
}   

Please pay your attention to the attribute:

C#
[SQLiteFunction(FuncType = FunctionType.Collation, Name = "UTF8CI")]

Here we define that:

  • This is a function that must be added to SQLite engine
  • It's a collation function
  • Its name is UTF8CI which means UTF8 Case Insensitive

To activate the collation, you need register it in SQLite engine – put this line of code before you create all SQLite connections in your application:

C#
SQLiteFunction.RegisterFunction(typeof(SQLiteCaseInsensitiveCollation));

Please note that the function must be called only once, e.g. when your application starts.

Using UTF8CI Collation in SQLite

How to use this collation in SQLite?

SQL
CREATE TABLE `testtbl` (
    `ID` INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
    `Name` TEXT NOT NULL COLLATE UTF8CI, 
    `Val` REAL)

As you see, everything we need is just replacing the collation. Now we also need create an index with this collation:

SQL
CREATE INDEX `IDX_testtbl_Name` ON `testtbl` (`Name` COLLATE UTF8CI)

Performing a Test

Now the SELECT will give us the result:

IDNameVal (some value stored in the table)Comment
2Bar345 
3bar456 
1Foo123 
2foo234 
5Буква567That's it!
6буква678
7Тест789
8тест890

Yep! The Russian data is sorted in case-insensitive way, as we wanted. Our index is used because it's case-insensitive too.

Compatibility 

What happens if you open SQLite file with UTF8CI encoding without adding our custom encoding? Let's see the result of SELECTs:

SQL
SELECT * FROM `testtbl`

It executes normally – neither sort collation, nor index is used.

SQL
SELECT * FROM `testtbl` ORDER BY `Name`

Oops. We get an error: No such collation: UTF8CI. To avoid this, you can use the following SQL:

SQL
SELECT * FROM `testtbl` ORDER BY `Name` COLLATE BINARY

This executes normally and gives the result as usual case-sensitive (binary) sort. Our index with UTF8CI encoding cannot be used: SQLite has no information about the collation of the index.

Points of Interest

There are several things we have learned:

  • It is possible to use case-insensitive sort for UTF8 data in SQLite.
  • You can also create case-insensitive UTF8 index in SQLite.
  • You can use my function (with your locale) for creating UTF8CI collation in SQLite.
  • Please note: if you will forget to register this function, you will get an exception when reading data.
  • The database will be usable without the code of the function; but sort by field will be unavailable only with built-in encodings.
  • Any index created with UTF8CI encoding will be disabled.

Revision History

  • Nov. 2009 - Initial revision 

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)