Introduction
If you are using SQLite in .NET, perhaps you could have the problem of case-insensitive sorting of UTF8 strings.
Background
This code is useful if you use System.Data.SQLite
, which can be downloaded here: Download System.Data.SQLite. I have been using SQLite for a long time and stored Russian strings in it. As it's stated in SQLite FAQ, case-insensitive matching of Unicode characters does not work. This problem has become very significant when the amount of stored data has grown. System.Data.SQLite
provides a cool solution for that but it's unclear how to implement it.
Using the Code
Using the code is pretty simple: just add this line in the beginning of your application:
SQLiteFunction.RegisterFunction(typeof(SQLiteCaseInsensitiveCollation));
The Database
Okay, let us start. First of all, SQLite has three built-in collations (if you need more information, you can read it on SQLite's site):
BINARY
– All the characters are treated as bytes NOCASE
– Compares ASCII characters in case-insensitive manner; others are treated as bytesRTRIM
– Just skips trailing spaces
We want to create the following table:
ID | Name | Val (some value stored in the table) | Comment |
1 | Foo | 123 | |
2 | foo | 234 | |
3 | Bar | 345 | |
4 | bar | 456 | |
5 | Буква | 567 | A sample Russian word starting from uppercase letter |
6 | буква | 678 | The same Russian word starting from lowercase letter |
7 | Тест | 789 | Another Russian word starting from uppercase letter |
8 | тест | 890 | The same Russian word starting from lowercase letter |
Here's SQL to create it:
CREATE TABLE `testtbl` (
`ID` INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
`Name` TEXT NOT NULL COLLATE NOCASE,
`Val` REAL)
No we'll run a simple SELECT
:
SELECT * FROM `testtbl` ORDER BY `Name`
We'll get the result:
ID | Name | Val (some value stored in the table) | Comment |
2 | Bar | 345 | |
3 | bar | 456 | |
1 | Foo | 123 | |
2 | foo | 234 | |
5 | Буква | 567 | Incorrect sort! |
7 | Тест | 789 |
6 | буква | 678 |
8 | тест | 890 |
As you see, data in Russian is sorted incorrectly. Correct would be the following order: Буква, буква, Тест, тест.
Now we'll fix this. It would be perfect if SQLite could sort by this expression (in C#):
string.Compare(x, y, CultureInfo.CreateSpecificCulture("ru-RU"),
CompareOptions.IgnoreCase);
Defining Case-Insensitive Collation in C# Code
I have created a simple class SQLiteCaseInsensitiveCollation
that adds collation to SQLite engine:
using System.Data.SQLite;
using System.Globalization;
namespace SQLiteUTF8CIComparison {
[SQLiteFunction(FuncType = FunctionType.Collation, Name = "UTF8CI")]
public class SQLiteCaseInsensitiveCollation : SQLiteFunction {
private static readonly CultureInfo _cultureInfo =
CultureInfo.CreateSpecificCulture("ru-RU");
public override int Compare(string x, string y) {
return string.Compare(x, y, _cultureInfo, CompareOptions.IgnoreCase);
}
}
}
Please pay your attention to the attribute:
[SQLiteFunction(FuncType = FunctionType.Collation, Name = "UTF8CI")]
Here we define that:
- This is a function that must be added to SQLite engine
- It's a collation function
- Its name is
UTF8CI
which means UTF8 Case Insensitive
To activate the collation, you need register it in SQLite engine – put this line of code before you create all SQLite connections in your application:
SQLiteFunction.RegisterFunction(typeof(SQLiteCaseInsensitiveCollation));
Please note that the function must be called only once, e.g. when your application starts.
Using UTF8CI Collation in SQLite
How to use this collation in SQLite?
CREATE TABLE `testtbl` (
`ID` INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
`Name` TEXT NOT NULL COLLATE UTF8CI,
`Val` REAL)
As you see, everything we need is just replacing the collation. Now we also need create an index with this collation:
CREATE INDEX `IDX_testtbl_Name` ON `testtbl` (`Name` COLLATE UTF8CI)
Performing a Test
Now the SELECT
will give us the result:
ID | Name | Val (some value stored in the table) | Comment |
2 | Bar | 345 | |
3 | bar | 456 | |
1 | Foo | 123 | |
2 | foo | 234 | |
5 | Буква | 567 | That's it! |
6 | буква | 678 |
7 | Тест | 789 |
8 | тест | 890 |
Yep! The Russian data is sorted in case-insensitive way, as we wanted. Our index is used because it's case-insensitive too.
Compatibility
What happens if you open SQLite file with UTF8CI encoding without adding our custom encoding? Let's see the result of SELECT
s:
SELECT * FROM `testtbl`
It executes normally – neither sort collation, nor index is used.
SELECT * FROM `testtbl` ORDER BY `Name`
Oops. We get an error: No such collation: UTF8CI
. To avoid this, you can use the following SQL:
SELECT * FROM `testtbl` ORDER BY `Name` COLLATE BINARY
This executes normally and gives the result as usual case-sensitive (binary) sort. Our index with UTF8CI encoding cannot be used: SQLite has no information about the collation of the index.
Points of Interest
There are several things we have learned:
- It is possible to use case-insensitive sort for UTF8 data in SQLite.
- You can also create case-insensitive UTF8 index in SQLite.
- You can use my function (with your locale) for creating UTF8CI collation in SQLite.
- Please note: if you will forget to register this function, you will get an exception when reading data.
- The database will be usable without the code of the function; but sort by field will be unavailable only with built-in encodings.
- Any index created with UTF8CI encoding will be disabled.
Revision History
- Nov. 2009 - Initial revision