Introduction
The generic sort algorithm in .NET does not perform well on sorting string
s. The reason for that is that it performs too many character comparisons while comparing string
s.
There is a better algorithm described by Prof. Robert Sedgewick called multikey quicksort. Sedgewick is known for his developer friendly books on algorithms and his approach to empirically study algorithm performance. In the case of string
sort, Sedgewick provides a reference implementation of string
sort in C:
His implementation is quite hard to understand but achieves excellent performance.
String Sort Implementation in C# and F#
I provide three implementations of multikey quick sort algorithm in C# and F# (in addition to the pseudo-like code below). My implementations are slightly different from Sedgewick's and slightly slower, but much faster than the .NET implementation. Initially, I had difficulty adapting Sedgewick's implementation due to a corner case when the string
can contain the '\0
' character. In the new code, this problem is fixed and I provide an implementation in C# very close to the reference.
The problem with .NET sort is not in the implementation per se, but in the algorithm. .NET implements plain quick sort, which is one of the fastest algorithms for integers or doubles. Sedgewick proposes a few extensions to plain quick sort which make the algorithm useful for:
- data which contains duplicates
- multikey data, such as strings or arrays of characters
Additionally, Sedgewick uses two other tricks:
- insertion sort for small arrays
- pseudo-median of 9 to find a good pivot
Pseudo-code of Multikey Quicksort
The real code of string
sort is somewhat hard to understand because it uses inplace update and multiple indices. The multikey quicksort algorithm however, is not. In this section, I provide actual F# code to serve as pseudo-code. This code can serve as a high-level specification of the actual algorithm. However, this pseudo-code omits details which affect performance.
let filter = Array.filter let concatenate = Array.concat
let cmp (depth: int, s: string, t: string) = let lenS, lenT = s.Length, t.Length
assert(lenS >= depth); assert(lenT >= depth) if (lenS = depth && lenT = depth) then (*return*) 0 else if (lenS = depth) then (*return*) -1 else if (lenT = depth) then (*return*) 1 else
assert(lenS > depth); assert(lenT > depth) let chS, chT = s.[depth], t.[depth] if chS < chT then (*return*) -1 else if chS = chT then (*return*) 0 else (*return*) 1
let rec pseudoCodeMultiKeyQuickSort(input: string[], depth: int): string[] = if (input.Length = 0) then Array.empty else
let pivot = input.[input.Length / 2] let less = filter(fun s -> cmp(depth, s, pivot) < 0) input let eq = filter(fun s -> cmp(depth, s, pivot) = 0) input let gt = filter(fun s -> cmp(depth, s, pivot) > 0) input let sortedLess = pseudoCodeMultiKeyQuickSort(less, depth) let sortedEq = if (eq.Length > 0 && depth < eq.[0].Length) then pseudoCodeMultiKeyQuickSort
(eq, depth + 1) else eq let sortedGt = pseudoCodeMultiKeyQuickSort(gt, depth) (*return*)
concatenate [|sortedLess; sortedEq; sortedGt|]
let pseudoStringSort (input: string[]) = (*return*) pseudoCodeMultiKeyQuickSort(input, 0 (*depth*))
pseudoStringSort [|"aaa";"a";"b";"";"ba";"bbb"|]
Please pay attention to the comparison function (named cmp
, line 3) which takes a depth parameter in addition to the two string
s (s
and t
) to be compared. Also note how the algorithm recurses on two dimensions: subarrays and the depth parameter (see Figure 1). You can think that the algorithm operates on levels: first it sorts by the first character, finds all string
s where the first character is equal to the pivot and, for this group of string
s only, sorts by the second character. Special care has to be taken for string
s which run out of characters as the depth parameter grows. Depth grows only on line 21. In that line, we have a block of string
s which have the same prefix from 0 to depth (inclusive). As depth grows, it will eventually end up with empty string
s. If the current suffix (starting at depth) of the first string
in the block of equal string
s (line 21) is the empty string
(tested by eq.[0].Length
condition), then because all string
s in this block are equal, they are all empty. This is used as an termination condition to the recursion by depth. The other termination condition is that a subarray has become empty.
Finally, some notes on the F# syntax might be useful for those unfamiliar with F#. [|a;b;c|]
is used to denote an array of a
, b
and c
; arr.[index]
is used to access an element at position index; let x = 2
is used to bind variable x
to value 2
; fun s -> f(s)
is the F# syntax for a lambda expression, or a predicate which will take s
and run f
with argument s
and return the result of f
to the caller. The predicate may be executed multiple times.
Figure 1: Multikey quick sort recursion.
Dimensions by Which to Judge String Sorting Algorithms
The following dimensions will affect which string
sorting algorithm performs the best:
Length of Input
Small arrays vs. large Arrays. .NET Sort will perform well on small string
arrays and on arrays where the string
s are short. As the input array grows in size, or the string
s in the array grow, .NET will take much longer to terminate.
Duplicate Strings
If there are many duplicate string
s, ternary partitioning (i.e. partitioning into three parts) in multikey quicksort (as implemented for string
s by Sedgewick) will pay off for its overhead. If duplicate string
s are not expected, binary partitioning algorithms will be faster. Ternary partitioning works in the following way. A pivot is chosen and in the end the elements are partitioned into three buckets in the following order: "<= pivot
", "= pivot
", and ">= pivot
". In contrast, binary partitioning will use only two buckets: "<= pivot
" and "> pivot
". Ternary partitioning is more complicated to implement without using an additional array as it first splits the elements in four groups in the order given: "= pivot
", "< pivot
", "> pivot
", "= pivot
". Then it moves back the two "= pivot
" blocks in the middle: "< pivot
", "= pivot
", "= pivot
", "> pivot
". In addition to having bigger inner loops than binary quick sort, one iteration of ternary quick sort requires more data swapping. The size of the code in the inner loop affects greatly the performance of quick sort. However, if the "= pivot
" blocks happen to be large, then there is less work to do when recursing quick sort on the "< pivot
" and "> pivot
" blocks. In the case of multikey quick sort, the pivot is a character and most likely will match many elements.
String Length
If the input array contains larger and larger string
s, the ratio between string
sort and generic sort in general becomes larger. This case is worsened if the string
s to be sorted contain a large number of long common prefixes. This is the case for sorting words from the English Language.
Adaptation to Non-English Languages
The reference code provided sorts string
s by comparing the characters of string
s from first to last, and compares the characters by their ASCII codes. There are two problems: reverse sort order and non-English languages.
- First, even if you provide a
char
comparator sorting in reverse will not work unless the algorithm is changed. This is caused by empty string
s (or string
s that become empty as the depth parameter grows). It's possible to change the algorithm to account for reverse comparators, but providing a char
comparator only is not sufficient.
- The second problem has to do with the fact that for many languages, a
char
comparator is not enough to derive a string
comparator. The issue is outlined in an MSDN article titled:
There are some locales in which there does not exist a one-to-one correspondence between characters and letters. To sort in non-English languages, one should break the sequence of characters into letters and then sort by letters. Luckily, .NET provides a class called SortKey
which will convert a string
in a non-English language to a byte array. Sorting string
s by their corresponding byte arrays will result in the correct sort order. In this way, the multikey quicksort algorithm can be easily adapted to operate on byte arrays instead on string
s. For details on converting string
s to SortKey
s, please check the referenced MSDN article. For efficiency purposes, string
s should be converted to SortKey
s before running the sorting algorithm.
Comparisons
As explained in the previous section, careful comparison of sorting algorithms will have to account for at least the above mentioned input dimensions. It is best to keep two of the dimensions fixed and plot the running times against values of the third. Due to lack of time, I will show only running times where each of the three dimensions is relatively large. My code contains a function which will run and test all algorithms for automatically generated inputs of specified dimensions. This function is found in the F# file provided.
number of unique elements = 5000000;
number of duplicates for each unique element = 5;
length of each string = 20;
random seed = 384585;
================================================================
String Sort in C# : avg: 31.247667(sec.) 0.4 %
String Sort in F# : avg: 44.563966(sec.) 43.2 %
Sedgewick (translated to C#): avg: 31.119719(sec.) 0.0 %
Dot Net Generic Sort : avg: 145.528939(sec.) 367.6 %
Fixed parameters as above, variable string length
===============================================================
string length = 5
-------------------
String Sort in C# : avg: 25.546378(sec.) 1.1 %
String Sort in F# : avg: 33.159264(sec.) 31.2 %
Sedgewick (translated to C#): avg: 25.268513(sec.) 0.0 %
Dot Net Generic Sort : avg: 129.865247(sec.) 413.9 %
===============================================================
string length = 10
-------------------
String Sort in C# : avg: 29.166639(sec.) -6.5 %
String Sort in F# : avg: 46.012092(sec.) 47.4 %
Sedgewick (translated to C#): avg: 31.210900(sec.) 0.0 %
Dot Net Generic Sort : avg: 152.109868(sec.) 387.4 %
===============================================================
string length = 20
-------------------
String Sort in C# : avg: 37.304173(sec.) 16.2 %
String Sort in F# : avg: 52.745780(sec.) 64.3 %
Sedgewick (translated to C#): avg: 32.095941(sec.) 0.0 %
Dot Net Generic Sort : avg: 162.737198(sec.) 407.0 %
===============================================================
string length = 40
-------------------
String Sort in C# : avg: 39.446201(sec.) 9.1 %
String Sort in F# : avg: 52.786923(sec.) 46.0 %
Sedgewick (translated to C#): avg: 36.151525(sec.) 0.0 %
Dot Net Generic Sort : avg: 167.100856(sec.) 362.2 %
===============================================================
string length = 40
-------------------
String Sort in C# : avg: 44.527835(sec.) 26.9 %
String Sort in F# : avg: 54.424359(sec.) 55.1 %
Sedgewick (translated to C#): avg: 35.098257(sec.) 0.0 %
Dot Net Generic Sort : avg: 164.124873(sec.) 367.6 %
Enjoy!