In this article, I am presenting a sorting schema, hugely inspired by the Bucket and Proxmap sorting algorithms. I call it "schema", rather than "algorithm", simply because it introduces a (or another, depending on the sorting algorithm) "devide and conquer" element to the existing sorting algorithms, aiming (imho) to make them faster.
Also, let's recall that most of the sorting algorithms come with the time complexity
${\displaystyle O(N \log{N})} \tag{1}$
With more details ...
1. The description of the schema
Let's suppose we have a list L with N (list size, thus a number) objects to sort. Now, let's imagine that we have M (a number) buckets which are pre-sorted (buckets, not the content) according to the same criteria applied to sort the objects from the list L. Let's use strings as an example, if
$L=\{\text{BBBB}, \text{BBAA}, \text{AAAA},\text{...}\}$
then we can "bucket" by the 1st letter of the string, giving us a total of 26 letters in the English alphabet (and yes let's ignore the case sensitivity and numbers and ... everything to keep things easy). We keep these buckets sorted as
$B_1=\{A\}, B_2=\{B\},...,B_{26}=\{Z\}$
Then we partition/distribute the content of L accross the buckets, sort the buckets individually and (finally) merge the content of the buckets to the final sorted list (simple traversal of the buckets, since they are already sorted). Something like this
What we have (in terms of time complexity) as a result
2. The mathematical argument
Time complexity in the schema described above is
${\displaystyle O\left(N + M \cdot \frac{N}{M}\cdot \log{\left(\frac{N}{M}\right)}+M\right)}$
which is
${\displaystyle O\left(N + N\cdot\log{\left(\frac{N}{M}\right)}+M\right)} \tag{2}$
If we compare (1) and (2)
$\frac{N + N\cdot\log{\left(\frac{N}{M}\right)}+M}{N \cdot\log{N}}= 1-\frac{\log{M} - 1}{\log{N}}+\frac{M}{N\cdot\log{N}} \tag{3}$
Assuming large enough M < N we have
$\frac{N + N\cdot\log{\left(\frac{N}{M}\right)}+M}{N \cdot\log{N}}< 1-\frac{\log{M} - 2}{\log{N}}< 1 \tag{4}$
In other words, for large N and M we should expect this schema to make most of the sorting algorithms "a little bit" faster. By how much? Let's see ...
3. Specific example
Let's consider lists of strings as an example. Here is the source code of a little PoC project which generates lists of randomly generated strings and sorts them using
- the Java's Collections::sort and
- an implementation of the schema above, which internally uses Collections::sort as well to sort the content of the buckets
and compares the results. No parallelisation is used, although sorting buckets in parallel could significantly improve the speed in the second (B) case, but no cheating!
Most of the aspects in the code are easy to tune, but we will consider only
- lists of size N=1000000 and N=3000000 and
- bucketing by the first 2 letters, giving the total number of buckets M=262
Plugging these numbers into the formula (3) we should expect
M=262 and N=1000000 | M=262 and N=3000000 |
0.60 | 0.63 |
An improvement of nearly 40%!? No way ...
4. Actual results
One more test parameter I should mention (and which is not part of the formula (3)) is X the length of each string added to the list L, this is to count for Java string comparison where string size plays a role. We will look for X=10, X=50 and X=100. Here are some results from running the PoC code on my fairly old computer with an Intel i5-3330 3.00GHz 4 Cores CPU on board, using Java 17
X=10, M=262 and N=1000000 | X=10, M=262 and N=3000000 |
Reported result:
Fast Sort avg: 422.33ms
Std Sort avg: 677.96ms
| Reported result:
Fast Sort avg: 1484.24ms
Std Sort avg: 2395.39ms
|
$\frac{422.33}{677.96}\approx 0.6229$
| $\frac{1484.24}{2395.39}\approx 0.6196$
|
A few more results
X=50, M=262 and N=3000000 | X=100, M=262 and N=3000000 |
Reported result:
Fast Sort avg: 1545.82ms
Std Sort avg: 2773.65ms
| Reported result:
Fast Sort avg: 1561.84ms
Std Sort avg: 2883.08ms
|
$\frac{1545.82}{2773.65}\approx 0.5573$
| $\frac{1561.84}{2883.08}\approx 0.5417$
|
5. Conclusions
Well, the test results are not too far off from the calculated results. So, the schema does improve the sorting ... However, I should mention the following:
And finally, the schema allows for
- parallelisation, content of the buckets can be sorted in parallel and
- imagine a data streaming scenario, if one bucket is updated, we don't have to re-sort the entire list L