(untagged)

Fast String Sort in C# and F#

Stefan Savev 2

0.00/5 (No votes)

16 Jan 2011

Implementation of Multikey String Quick Sort (following Sedgewick)

Download source - 7.08 KB

Introduction

The generic sort algorithm in .NET does not perform well on sorting strings. The reason for that is that it performs too many character comparisons while comparing strings.

There is a better algorithm described by Prof. Robert Sedgewick called multikey quicksort. Sedgewick is known for his developer friendly books on algorithms and his approach to empirically study algorithm performance. In the case of string sort, Sedgewick provides a reference implementation of string sort in C:

Sedgewick's Reference Implementation of Sorting Strings

His implementation is quite hard to understand but achieves excellent performance.

String Sort Implementation in C# and F#

I provide three implementations of multikey quick sort algorithm in C# and F# (in addition to the pseudo-like code below). My implementations are slightly different from Sedgewick's and slightly slower, but much faster than the .NET implementation. Initially, I had difficulty adapting Sedgewick's implementation due to a corner case when the string can contain the '\0' character. In the new code, this problem is fixed and I provide an implementation in C# very close to the reference.

String Sort port to C#

The problem with .NET sort is not in the implementation per se, but in the algorithm. .NET implements plain quick sort, which is one of the fastest algorithms for integers or doubles. Sedgewick proposes a few extensions to plain quick sort which make the algorithm useful for:

data which contains duplicates
multikey data, such as strings or arrays of characters

Additionally, Sedgewick uses two other tricks:

insertion sort for small arrays
pseudo-median of 9 to find a good pivot

Pseudo-code of Multikey Quicksort

The real code of string sort is somewhat hard to understand because it uses inplace update and multiple indices. The multikey quicksort algorithm however, is not. In this section, I provide actual F# code to serve as pseudo-code. This code can serve as a high-level specification of the actual algorithm. However, this pseudo-code omits details which affect performance.

let filter = Array.filter //used as an alias         		//line  1
let concatenate = Array.concat //another alias                  //line  2

let cmp (depth: int, s: string, t: string) =                    //line  3
    let lenS, lenT = s.Length, t.Length
    assert(lenS >= depth); assert(lenT >= depth)           	//line  4
    if (lenS = depth && lenT = depth) then (*return*) 0       	//line  5
    else if (lenS = depth) then (*return*) -1                   //line  6
    else if (lenT = depth) then (*return*) 1                    //line  7
    else
        assert(lenS > depth); assert(lenT > depth)              //line  8
        let chS, chT = s.[depth], t.[depth]                     //line  9
        if chS < chT then (*return*) -1                         //line 10
        else if chS = chT then (*return*) 0                     //line 11
        else (*return*) 1                                   	//line 12

// Do not use in production. This code is meant to serve as pseudo-code only
// Production code uses details that are not presented here
let rec pseudoCodeMultiKeyQuickSort(input: string[], depth: int): string[] = //line 13
    if (input.Length = 0) then Array.empty                               	//line 14
    else												
        let pivot = input.[input.Length / 2] //in reality median of 3 or 9 is used //line 15
        //in reality the following filters are done in place in two passes         //line 16
        let less = filter(fun s -> cmp(depth, s, pivot) < 0) input //  s < pivot	//line 17
        let eq = filter(fun s -> cmp(depth, s, pivot) = 0) input   //  s = pivot 	//line 18
        let gt = filter(fun s -> cmp(depth, s, pivot) > 0) input   //  s > pivot 	//line 19
        let sortedLess = pseudoCodeMultiKeyQuickSort(less, depth) 		//line 20
        let sortedEq = if (eq.Length > 0 && depth < eq.[0].Length) then	//line 21
                           pseudoCodeMultiKeyQuickSort
			(eq, depth + 1) //note that depth grows   //line 21
                       else eq                                       	//line 22
        let sortedGt = pseudoCodeMultiKeyQuickSort(gt, depth)        	//line 23
        //in reality the algorithm updates the same array and					
        //recursively passes from and to indices to define subarrays
        (*return*) 
        concatenate [|sortedLess; sortedEq; sortedGt|]              	//line 24

let pseudoStringSort (input: string[]) =                               	//line 25
    (*return*) pseudoCodeMultiKeyQuickSort(input, 0 (*depth*))         	//line 26

pseudoStringSort [|"aaa";"a";"b";"";"ba";"bbb"|]

Please pay attention to the comparison function (named cmp, line 3) which takes a depth parameter in addition to the two strings (s and t) to be compared. Also note how the algorithm recurses on two dimensions: subarrays and the depth parameter (see Figure 1). You can think that the algorithm operates on levels: first it sorts by the first character, finds all strings where the first character is equal to the pivot and, for this group of strings only, sorts by the second character. Special care has to be taken for strings which run out of characters as the depth parameter grows. Depth grows only on line 21. In that line, we have a block of strings which have the same prefix from 0 to depth (inclusive). As depth grows, it will eventually end up with empty strings. If the current suffix (starting at depth) of the first string in the block of equal strings (line 21) is the empty string (tested by eq.[0].Length condition), then because all strings in this block are equal, they are all empty. This is used as an termination condition to the recursion by depth. The other termination condition is that a subarray has become empty.

Finally, some notes on the F# syntax might be useful for those unfamiliar with F#. [|a;b;c|] is used to denote an array of a, b and c; arr.[index] is used to access an element at position index; let x = 2 is used to bind variable x to value 2; fun s -> f(s) is the F# syntax for a lambda expression, or a predicate which will take s and run f with argument s and return the result of f to the caller. The predicate may be executed multiple times.

Figure 1: Multikey quick sort recursion.

Dimensions by Which to Judge String Sorting Algorithms

The following dimensions will affect which string sorting algorithm performs the best:

Length of Input

Small arrays vs. large Arrays. .NET Sort will perform well on small string arrays and on arrays where the strings are short. As the input array grows in size, or the strings in the array grow, .NET will take much longer to terminate.

Duplicate Strings

If there are many duplicate strings, ternary partitioning (i.e. partitioning into three parts) in multikey quicksort (as implemented for strings by Sedgewick) will pay off for its overhead. If duplicate strings are not expected, binary partitioning algorithms will be faster. Ternary partitioning works in the following way. A pivot is chosen and in the end the elements are partitioned into three buckets in the following order: "<= pivot", "= pivot", and ">= pivot". In contrast, binary partitioning will use only two buckets: "<= pivot" and "> pivot". Ternary partitioning is more complicated to implement without using an additional array as it first splits the elements in four groups in the order given: "= pivot", "< pivot", "> pivot", "= pivot". Then it moves back the two "= pivot" blocks in the middle: "< pivot", "= pivot", "= pivot", "> pivot". In addition to having bigger inner loops than binary quick sort, one iteration of ternary quick sort requires more data swapping. The size of the code in the inner loop affects greatly the performance of quick sort. However, if the "= pivot" blocks happen to be large, then there is less work to do when recursing quick sort on the "< pivot" and "> pivot" blocks. In the case of multikey quick sort, the pivot is a character and most likely will match many elements.

String Length

If the input array contains larger and larger strings, the ratio between string sort and generic sort in general becomes larger. This case is worsened if the strings to be sorted contain a large number of long common prefixes. This is the case for sorting words from the English Language.

Adaptation to Non-English Languages

The reference code provided sorts strings by comparing the characters of strings from first to last, and compares the characters by their ASCII codes. There are two problems: reverse sort order and non-English languages.

First, even if you provide a char comparator sorting in reverse will not work unless the algorithm is changed. This is caused by empty strings (or strings that become empty as the depth parameter grows). It's possible to change the algorithm to account for reverse comparators, but providing a char comparator only is not sufficient.
The second problem has to do with the fact that for many languages, a char comparator is not enough to derive a string comparator. The issue is outlined in an MSDN article titled:
- Globalization Step-by-Step, Sorting and String Comparison
There are some locales in which there does not exist a one-to-one correspondence between characters and letters. To sort in non-English languages, one should break the sequence of characters into letters and then sort by letters. Luckily, .NET provides a class called SortKey which will convert a string in a non-English language to a byte array. Sorting strings by their corresponding byte arrays will result in the correct sort order. In this way, the multikey quicksort algorithm can be easily adapted to operate on byte arrays instead on strings. For details on converting strings to SortKeys, please check the referenced MSDN article. For efficiency purposes, strings should be converted to SortKeys before running the sorting algorithm.

Comparisons

As explained in the previous section, careful comparison of sorting algorithms will have to account for at least the above mentioned input dimensions. It is best to keep two of the dimensions fixed and plot the running times against values of the third. Due to lack of time, I will show only running times where each of the three dimensions is relatively large. My code contains a function which will run and test all algorithms for automatically generated inputs of specified dimensions. This function is found in the F# file provided.

number of unique elements = 5000000;
number of duplicates for each unique element = 5;
length of each string = 20;

random seed = 384585;
================================================================
String Sort in C#           :   avg: 31.247667(sec.)    0.4 %
String Sort in F#           :   avg: 44.563966(sec.)    43.2 %
Sedgewick (translated to C#):   avg: 31.119719(sec.)    0.0 %
Dot Net Generic Sort        :   avg: 145.528939(sec.)   367.6 %

Fixed parameters as above, variable string length
===============================================================
string length = 5 
-------------------
String Sort in C#           :   avg: 25.546378(sec.)    1.1 %
String Sort in F#           :   avg: 33.159264(sec.)    31.2 %
Sedgewick (translated to C#):   avg: 25.268513(sec.)    0.0 %
Dot Net Generic Sort        :   avg: 129.865247(sec.)   413.9 %

===============================================================
string length = 10
-------------------
String Sort in C#           :   avg: 29.166639(sec.)    -6.5 %
String Sort in F#           :   avg: 46.012092(sec.)    47.4 %
Sedgewick (translated to C#):   avg: 31.210900(sec.)    0.0 %
Dot Net Generic Sort        :   avg: 152.109868(sec.)   387.4 %

===============================================================
string length = 20
-------------------
String Sort in C#           :   avg: 37.304173(sec.)    16.2 %
String Sort in F#           :   avg: 52.745780(sec.)    64.3 %
Sedgewick (translated to C#):   avg: 32.095941(sec.)    0.0 %
Dot Net Generic Sort        :   avg: 162.737198(sec.)   407.0 %

===============================================================
string length = 40
-------------------
String Sort in C#           :   avg: 39.446201(sec.)    9.1 %
String Sort in F#           :   avg: 52.786923(sec.)    46.0 %
Sedgewick (translated to C#):   avg: 36.151525(sec.)    0.0 %
Dot Net Generic Sort        :   avg: 167.100856(sec.)   362.2 %
===============================================================
string length = 40
-------------------
String Sort in C#           :   avg: 44.527835(sec.)    26.9 %
String Sort in F#           :   avg: 54.424359(sec.)    55.1 %
Sedgewick (translated to C#):   avg: 35.098257(sec.)    0.0 %
Dot Net Generic Sort        :   avg: 164.124873(sec.)   367.6 %

Enjoy!

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here