Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C#

Fastest Sort Algorithm

3.24/5 (8 votes)
19 Aug 2020CPOL20 min read 12.3K   331  
In-place merge sort on Multi-core and Cache
This is an introduction of my original in-place merge sort algorithm. First, considering what is fastest sort from theoretical analysis. Second, explaining my sort program and applying it to cache and multi-core. Third, from measurement, confirming it is top level speed. The conclusion is, my sort can be said to be the fastest, from cache and multi-core utilization level.

Introduction

This article is about In-place Merge Sort algorithm, fitted with parallel processing and providing with translation reduction between cache and main memory. I wrote the previous version at the time I confirmed the breakthrough idea's operation. It included many incomplete parts. I have considered all and it reaches almost consistent level. So I revise this edition.

I don't change my opinion "this is the fastest sort algorithm" at all. However, the present level is only a confirmation of the operation by C#. Program of C# never works fastest. However, I think this article content is useful to some readers annoyed by slow sort speed.

If I will explain in detail, this article will become as long as a book. So I apologize, this article is only summary level. I hope you to read the attached C# list and figure out the parts that have not been explained in this article. I intend to take care of easy expression of the program (though some rules are not common).

The Article Body

Competition of Sorting Time

The time that CPU executes sorting process is expressed by the below formula.

Sorting time = (1)The time of sort process itself + (2-1) One data comparison (evaluate) time * (2-2) comparison count + (3-1) One data move time * (3-2) move count

(1) can be ignored because of too small against others. (2-2)(3-2) are values showing sort algorithm speed. I measured representative sorting algorithm (Sortings.cs) counts actually.

I omit all algorithm explanation because it can easily be searched on the internet.

To make clear the fairness of count to all algorithm, data encapsulation prohibits each algorithm to operate data directly. (DataArray.cs)

I omit Radix sort because it is not valid to whole sort demand.

And I give up below commonly known hybrid algorithms measurement. The implant of count process for these original program is, I can't judge whether these implants are correct or not. Converting the process to this program shape for measurement is difficult and I can't judge conversion is OK or not.

These sortings are classifiable to below 2 tribes. I recognize each one is an improvement of these base and not too far from.

  • <Quicks> GNU QuickSort, Pattern-defeating QuickSort (PdqSort), IntroSort
  • <Merges> TimSort, QuadSort, GNU Stable Sort, Block Sort

The result table is below.

8/10/12 elements' values are all combination results. 100/512/1000 elements' values are the result of the same number sorting cycles.

This is my PC ability limit to end test within realistic time and to retain measurement data.

Comparison and Move count measurement in Sorting
Data amount   Comparison     Move  
  minimum maximum average minimum maximum average
Selection sort
8 28 28 28.00000 0 14 10.56429
10 45 45 45.00000 0 18 14.14206
12 66 66 66.00000 0 22 17.79358
100 4950 4950 4950.00000 180 198 190.12000
512 130816 130816 130816.00000 996 1022 1010.32813
1000 499500 499500 499500.00000 1966 1998 1985.18800
Insertion sort
8 7 28 19.28214 0 35 19.28214
10 9 45 29.57103 0 54 29.57103
12 11 66 41.89679 0 77 41.89679
100 2135 2987 2577.11000 2133 2988 2577.17000
512 61020 71182 65994.41992 61016 71181 65994.26172
1000 236268 268100 251069.41000 236269 268097 251069.47100
Shell sort
8 11 26 17.91667 0 32 16.24048
10 15 37 25.51667 0 42 22.72937
12 21 52 34.71905 0 59 29.23622
100 689 812 732.76000 527 670 587.21000
512 5544 6101 5785.70508 4258 4964 4579.62109
1000 12609 13541 13053.70600 9739 10864 10286.34600
Binary Insertion sort
8 13 17 15.59048 0 35 19.28214
10 19 25 22.21270 0 54 29.57103
12 25 33 29.42482 0 77 41.89679
100 520 540 530.77000 2133 2988 2577.17000
512 3876 3936 3903.72070 61016 71181 65994.26172
1000 8547 8629 8586.72000 236269 268097 251069.47100
Bubble Sort
8 7 28 25.99529 0 56 28.00000
10 9 45 41.99319 0 90 45.00000
12 11 66 61.90309 0 132 66.00000
100 4544 4950 4875.77000 4080 5786 4964.76000
512 128041 130816 130381.95703 121026 141350 130978.04688
1000 491874 499500 498604.20500 470546 534210 500153.79200
Gnome Sort
8 7 56 33.28214 0 56 28.00000
10 9 90 52.07103 0 90 45.00000
12 11 132 74.89679 0 132 66.00000
100 4175 5880 5059.49000 4080 5786 4964.76000
512 121533 141857 131483.44336 121026 141350 130978.04688
1000 471541 535205 501146.30600 470546 534210 500153.79200
QuickSort
8 22 43 31.45833 0 28 14.58571
10 32 60 43.49929 0 38 19.89810
12 42 79 55.82187 0 50 25.64283
100 766 1048 837.44000 312 406 370.04000
512 5399 6559 5726.65039 2338 2618 2476.96094
1000 11612 13847 12314.54900 5088 5520 5300.98800
QuickSort (No overlapping?)
8 13 19 15.77619 0 28 14.58571
10 19 29 22.58571 0 38 19.89810
12 25 41 30.02121 0 48 25.46762
100 531 746 574.37000 316 396 358.28000
512 4007 5109 4329.97852 2248 2534 2398.58594
1000 8908 10996 9596.64700 4884 5332 5141.92800
HeapSort
8 21 31 26.53512 26 56 40.04683
10 31 47 39.49517 36 82 56.62910
12 41 65 53.17458 46 110 74.08227
100 1029 1087 1056.76000 1190 1312 1247.64000
512 7715 7928 7817.87891 8620 9044 8812.53906
1000 17073 17392 17226.73000 18894 19480 19185.00800
HeapSort (Avoid comparison with ancestor)
8 16 30 23.39206 26 56 40.04683
10 23 44 34.28377 36 82 56.62910
12 30 61 46.54484 46 110 74.08227
100 920 1003 961.27000 1190 1312 1247.64000
512 7170 7447 7313.25391 8620 9044 8812.53906
1000 15995 16479 16228.24500 18894 19480 19185.00800
Top Down Merge Sort (Not-In-place)
8 12 17 15.73333 48 48 48.00000
10 15 25 22.66667 68 68 68.00000
12 20 33 29.95238 88 88 88.00000
100 526 554 541.33000 1344 1344 1344.00000
512 3924 4009 3962.76367 9216 9216 9216.00000
1000 8656 8757 8707.68900 19952 19952 19952.00000
Merge (2^N) Sort (Bottom Up, Not-In-place)
8 12 17 15.73333 48 48 48.00000
10 15 27 23.84444 72 72 72.00000
12 20 33 30.35556 88 88 88.00000
100 535 574 557.76000 1376 1376 1376.00000
512 3924 4009 3962.76367 9216 9216 9216.00000
1000 8661 8771 8719.75200 19968 19968 19968.00000
Merge (2^N) Sort, Half copy (Bottom Up, Not-In-place)
8 12 17 15.73333 24 24 24.00000
10 15 27 23.84444 40 40 40.00000
12 20 33 30.35556 48 48 48.00000
100 535 574 557.76000 700 700 700.00000
512 3924 4009 3962.76367 4608 4608 4608.00000
1000 8661 8771 8719.75200 10000 10000 10000.00000

Each algorithms nature seems to be more clear than conventional landau symbol and sorting time test result.

QuickSort/HeapSort/Merge Sort Bottom up approaches have 2 types. One is original. The other is a faster version altering as small as possible the wasteful part.

QuickSort faster one is very ugly. GNU QuickSort waives before such and does Insertion sort at last. GNU QuickSort is right for practical use. My one is only an experimental try.

I choose my idea, as Merge Sort Bottom Up approach, fixing sort size to 2^N, by the reason of commonality to In-place Merge Sort of next chapter. This speed up version uses the way only possible in object-oriented language. QuadSort has done the same level move count decrement to any buffer copying. So QuadSort is better than this in Merge Sort using buffer.

In usual time measurement tests, one process time of both comparison and move is almost the same. By that, addition value of comparison count and move count in this table can be regarded time measurement test result. Seeing this table, QuickSort values are well-balanced both comparison and move. It agrees with the result QuickSort is fastest in time measurement test.

I warn you of one thing here. One move process time is almost the same both time measurement test and fact sort. One comparison process time of fact sort becomes larger than test. This is the main factor that sorting needs time. QuickSort comparison count is not smallest. Merge Sort is smaller.

I did a time measurement test of GNU QuickSort and In-place Merge Sort, that I explain later. GNU qsort is clearly faster in pure time test. But only adding 1usec wait to comparison process, In-place Merge Sort becomes faster. Merge Sort is better for heavy comparison process sorting. (I think it is common sort demand.)

If you see this table carefully, you may find an objection. In fact, the smallest comparison count sort is Binary Insertion Sort. Its move count is too large. So it never be the fastest in time measurement test. Array sorting is an unspoken agreement. Its move count in array sorting is O(N^2). But in linked list, its move count is O(N). Isn't data structure:"Linked list" and algorithm:"Binary Insertion Sort" combination fastest? For the confirmation of it, I attach linked list structure to array.(DataList.cs) Below table is its count measurement result.

Binary Insertion sort (List) List move count
8 13 17 15.59048 0 21 15.84643
10 19 25 22.21270 0 27 21.21310
12 25 33 29.42482 0 33 26.69037
100 520 540 530.77000 267 294 284.37000
512 3876 3936 3903.72070 1500 1533 1515.71484
1000 8547 8629 8586.72000 2943 2994 2977.72500
Binary Insertion sort (List) Data move count
8 13 17 15.59048 0 14 10.56429
10 19 25 22.21270 0 18 14.14206
12 25 33 29.42482 0 22 17.79358
100 520 540 530.77000 180 198 190.12000
512 3876 3936 3903.72070 996 1022 1010.32813
1000 8547 8629 8586.72000 1966 1998 1985.18800

I attach this linked list part into sort process. It is smallest count sort algorithm both comparison and move. "Binary Insertion using tiny Linked List sort", I named it BillSort for short. I did time measurement of it. Contrary to expectation, it is very slow speed endless sorting.

I pull out the slow point to below. It takes time more than 10 times of other total. If thinking only CPU process time, we never understand what wastes such a lot of time.

C++
for (index k = half; k > 0; k--) {
    j = Next[j];
}

The endless reason is the transfer process time between cache and main memory of O(N^2). We can't hope fast speed to the algorithm using linked list.

It is sure that this CPU processing time is minimum in all sorting algorithm. If whole RAM of the computer works L1 cache speed, it will gain the fastest position. Such computer may come out in the future. So I place it in this article. But as a process in present PC, I throw away the possibility that Binary Insertion Sort becomes fastest.

We must be conscious of memory access of the process, not only CPU execution time.

In-place Merge Sort

I conclude in the previous chapter that Merge Sort is suitable for general sorting.

But there is one problem. Merge Sort is not In-place. Area O(N) except for sorted array is necessary. I think it is silly problem now from next things. Whole array data copy is not necessary. And present PC memory size have become sufficiently large from the time it was born and discussed.

But thinking cache, In-place Merge Sort is desirable and I have made the algorithm I show now. The base methods are Bottom Up approach and size limitation to 2^N.

Below three realizations are necessary for it. (InPlaceMerge.cs)

  1. Exchange process of different size 2 areas that sit side by side.
  2. Different size 2 areas merging. Fast process investigation.
  3. In-place Merge Sort move count is O(NlogN^2). Not-In-place Merge Sort move count is O(NlogN). When array element amount is large, Not-In-place surely got ahead of In-place. So O(NlogN) method of In-place is necessary.

(1) Exchange process of different size 2 areas that sits side by side.

Shifting elements one by one. The method comes first of all. Knuth supposed the faster idea, but there is still no realized program. That's all I found in internet articles.

But already good way is used in GNU stable sort. Reverse each area independently. And reverse both connected area.

Before knowing it, I found out a different way.

  • Shift smaller area to its integer multiple length. (This length must not exceed larger area length.)
  • If (larger area)/(smaller area) = integer, this process finishes. Otherwise, do the same 1.process between shifted smaller area and surplus area of larger area.
  • Repeat it until reaching.

It is just the same algorithm as greatest common divisor calculation. I name it GCD exchange.

Supposing each area size M, N. GNU always needs 2(M+N) move. GCD needs minimum (M+N) and it is convergent until 2(M+N). Measurement result is below table. And it can divide shifting size within cache size. I think GCD is superior to GNU in practical use. This program list seems a little complex. But supposing that it is realized on logic circuit, it is the same level of complexity as other ways.

Merge (2^N) Sort (In-place, GNU exchange)
8 12 17 15.73333 0 46 25.02857
10 15 27 23.84444 0 78 40.96190
12 20 33 30.35556 0 110 57.40433
100 533 573 557.85000 2682 3868 3307.00000
512 3910 4007 3962.97852 63900 76296 70696.91406
1000 8654 8771 8719.70700 245960 285706 266366.66400
Merge (2^N) Sort (In-place, GCD exchange)
8 12 17 15.70476 0 33 18.49524
10 15 27 23.26032 0 52 28.60635
12 20 33 30.14113 0 72 39.64978
100 530 568 553.34000 1550 2159 1867.88000
512 3910 3994 3953.73047 33723 39902 37093.05664
1000 8640 8754 8700.99400 126922 146627 137043.47400

(2) Speed investigation of different size two areas merging.

Generally Merge Sort means Top Down. Bottom Up is sometimes no regard.

We can see in the previous chapter table that Top Down Merge Sort is faster than Bottom Up in some points. Top Down always does the almost same size merge. Bottom Up must do largely different size merge in some case. We must consider this correspondence for speed up.

Below is the table of the case merging 1 ~ 8 elements into 8 elements. Combination of top element location of 1 ~ 8 elements in 8 elements area (9 location).

Area Data Insert Data Amount
Amount 1 2 3 4 5 6 7 8
  1 9 45 165 495 1287 3003 6435
8 1 8 36 120 330 792 1716 3432
  1 7 28 84 210 462 924 1716
  1 6 21 56 126 252 462 792
  1 5 15 35 70 126 210 330
  1 4 10 20 35 56 84 120
  1 3 6 10 15 21 28 36
  1 2 3 4 5 6 7 8
  1 1 1 1 1 1 1 1
Total 9 45 165 495 1287 3003 6435 12870

This is Σ formula calculation we learned in high school.

  • More than half (4 ~ 8 elements): Merge from top as usual is no problem. (Preciously saying about 4~7 elements, next of top is its center location.)
  • 1 element: Binary search and Insertion. Binary Search itself.
  • 2 elements: Calculation of multiplication, and subtraction of decrement value. Serious calculation is not burden for CPU.
  • Others : Calculation of N-power formula, and subtraction of (N-1)-power formula. When element amount is large, it exceeds negligible time and complexity. It is not result but initial value of searching. So I decided start value around. When element amount exceeds 2, calculate 2 elements value and multiply 1/2^(N-2). It is this time approximation.

This alteration is already done to the above result table. So the counts are improved from the previous article.

(3) Making In-place Merge Sort exchange count to O(NlogN)

I said in the previous article, when data amount is large, switching to Not-In-place Merge Sort was necessary.

O(NlogN^2) is absolutely decition thing as long as merge starts from top. But I have found an idea to decrease move count of In-place merge.

Merge from the center, and do area division, like Quick sort. At first, get one center element from large area. Search its insert location in small area. The probability density would be mountain shape that center is highest. So search from center, by 1 step, in small area. One merge is divided into 2 small merges. Massive area moves at once and sum of move count decreases straight away. I name it center-to-center merge.

Its Landau symbol N degree clearly goes down. But I can't fix that it is truly O(NlogN), because I am not mathematical theory expert.

Comparison count increases. It has sub effect.

Merge (2^N) Sort (In-place, Center-to-center merge)
8 13 23 18.46190 0 33 19.00952
10 18 35 27.55238 0 50 29.18730
12 24 44 35.52915 0 70 39.98225
100 637 698 663.45000 1130 1339 1219.56000
512 4741 4963 4855.43359 11516 12248 11881.91992
1000 10551 10890 10736.78400 28425 29889 29095.46700

But it has clear merit.

  • Compared to original merge, the transfer interval between cache and main memory is reduced.
  • Plural merges can be processed in parallel processing.

Not-In-place merge using buffer can't realize this. The algorithm becomes pure In-place Merge Sort process, leaving from the spell.

Switching 2 In-place merge methods seem good. Within the cacheable size, merge using "from top" methods and make sorted blocks. When the size exceeds cacheable, merge using center-to-center merge method.

I think it comes almost terminus of In-place Merge Sort on single core. Though some adjustment may be necessary, necessary technology should have already come out. "Merge, In-place, Cache adaptable, Exponent of 2 (2^N)", I name it MiceSort for short.

MiceSort (Merge,In-place,Cache adaptable,Exponent of 2(2^N))
8 12 17 15.69048 0 33 18.49524
10 16 25 22.60159 0 52 28.60635
12 21 35 29.88644 0 72 39.64978
100 522 555 541.98000 1550 2159 1867.88000
512 3917 3996 3955.87891 24785 29228 27130.36328
1000 8656 8798 8722.83000 64481 75087 70091.43700

In measurement result, comparison count is almost the same as the original and move count decreases. I have converted it to C-language and done time competition with GNU qsort. It is slower than qsort when comparison process is pure comparison only. But it becomes faster when comparison process includes 1 usec wait.

Parallel Sorting

In-place merge process can be transformed to Parallel processing.(Parallel.cs) I intend to make documentation about this process as well as possible.

  1. Run sort robots as same amount as CPU multi-core. These robots do the same operation. Each robot cleans up the sort process independently. When all sort process is tidied up, the sort is ending.

    If same cleaning robots all you have run at the same time, spacious room cleaning time is shortened to 1/(robot amount). That is this process image.

    We have been experiencing break down of "Unhandled exception" or deadlock, when we run tasks one after another asynchronously at any parallelizable points. Even if we intend to limit task to core amount, we will encounter exception of unsure reason by silly modification.

    One radical settlement is to stop parallel processing. Another is the method shown here. The same amount of tasks as core keep on running. Otherwise, no one can predict the parallel process works safely on any circumstances.
  2. Divide sort data into cacheable size blocks. One process changes these blocks into sorted blocks. Other process merges these blocks into merged blocks and merge to higher level. These 2 states are necessary.

    Prepare a array showing each block situation, not-sorted or under sorting or sorted, and block size. Each sort robot refers this array and knows unprocessed things and cleans up what it must do.

    At the time top level merge finished, the sort is completed.

    Sort data are nonuniformity, mix of almost sorted place and random place. Cache size block have effect to smooth workload of each core, too.

    Previous article ended here. While climax large merging, operating cores faded away one after another. Top level whole array size merge was done by one core totally. It is very wasteful for CPU including many cores. I add next process.
  3. In merge process of sorted blocks, use center-to-center merge method. By that, the merge process is divided into many merges. Push these merges to stack. When all merges in stack is cleared off, the merge finishes.
  4. Make management center that all stacks of each core are got together. This center can be referred from every core. When all job it must do is end, the core gets a merge from this center, largest merge of stack bottom.

By that, each core works parallelly until the very end.

I name it "Both sides now sort". From song "Both sides now" lyrics.

Though present situation is not cheerful, I believe this is the fastest sort algorithm, and it almost reaches near to possible limit of sorting on multi-core CPU, and it is basement of future parallel sorting.

I use goto sentence in this program list. In fact, I made goto-less structured program, but I threw away it because it is difficult to guess the process purpose. To keep process order from top to bottom is impossible now, by the obstruction of parallel, asynchronous, callback, etc.. For the readable description, using goto sentence efficiently is necessary now. It is my opinion.

Notes of Attached Program

One is that it sometimes judges sorting is not good.

If you encounter the problem in the execution of attached list, you must increase number of Sleep(1) in the program to OK value. 1usec is number for small data element.

I know lock is necessary to this settlement. But lock makes slow the sorting. I judged wait time after sorting is better.

Another is present program doesn't work fastest. It is slower than single core Mice Sort. Though the reason may be I test it on i5 note(2 core).

But I think alteration point still remains. I have used lock sentence everywhere conflict can occur for safety. I already know some lock can be vanished. Reconsideration around lock is necessary.

Speed check on Windows C# may be meaningless. The process time varies largely by conditions, only moving process location, and Visual Studio runs or not. I can't stop background task of Windows.

Left Things

The attached program includes constant like cache size. The value in it is empirically fast value on note i5. It is not adapted to theoretically fastest value. It should differ by CPU and it should differ on your PC. In fact sorting, if comparison process of the sort uses many work areas or table, this value should differ. On little core CPU like the one I use now, OS task should occupy some cache of sorting core. Still, I have seen the speed difference with the change of these constant in this experiment. Because CPU has plural level caches, the consciousness to cache in the program affect to speed.

Fix of the value is impossible for me. The person who can do it is a CPU maker engineer knowing cache control logic.

Or only user programmer can define the efficient value to his purpose.

I attempted to convert "Both sides now sort" to C-language for the competition with qsort. But C-language parallel process is environment‐dependence. C# shape is easy and convenient for me even if it is slow. Even if converting program working on Linux, it is meaningless for me. I don't have expensive many-core PC. Until no environment‐dependence parallel sentence of C-language comes, I postpone it.

I intend to alter around lock before that.

Though I feel this may be the last one, I write about this theme.

History

  • 19th August, 2020: Initial version
  • 27th January, 2020: Updated article and source code

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)