Fastest Sort Algorithm

Shuji Kaya

3.24/5 (8 votes)

19 Aug 2020CPOL20 min read

12.3K

331

In-place merge sort on Multi-core and Cache

This is an introduction of my original in-place merge sort algorithm. First, considering what is fastest sort from theoretical analysis. Second, explaining my sort program and applying it to cache and multi-core. Third, from measurement, confirming it is top level speed. The conclusion is, my sort can be said to be the fastest, from cache and multi-core utilization level.

Introduction

This article is about In-place Merge Sort algorithm, fitted with parallel processing and providing with translation reduction between cache and main memory. I wrote the previous version at the time I confirmed the breakthrough idea's operation. It included many incomplete parts. I have considered all and it reaches almost consistent level. So I revise this edition.

I don't change my opinion "this is the fastest sort algorithm" at all. However, the present level is only a confirmation of the operation by C#. Program of C# never works fastest. However, I think this article content is useful to some readers annoyed by slow sort speed.

If I will explain in detail, this article will become as long as a book. So I apologize, this article is only summary level. I hope you to read the attached C# list and figure out the parts that have not been explained in this article. I intend to take care of easy expression of the program (though some rules are not common).

The Article Body

Competition of Sorting Time

The time that CPU executes sorting process is expressed by the below formula.

Sorting time = (1)The time of sort process itself + (2-1) One data comparison (evaluate) time * (2-2) comparison count + (3-1) One data move time * (3-2) move count

(1) can be ignored because of too small against others. (2-2)(3-2) are values showing sort algorithm speed. I measured representative sorting algorithm (Sortings.cs) counts actually.

I omit all algorithm explanation because it can easily be searched on the internet.

To make clear the fairness of count to all algorithm, data encapsulation prohibits each algorithm to operate data directly. (DataArray.cs)

I omit Radix sort because it is not valid to whole sort demand.

And I give up below commonly known hybrid algorithms measurement. The implant of count process for these original program is, I can't judge whether these implants are correct or not. Converting the process to this program shape for measurement is difficult and I can't judge conversion is OK or not.

These sortings are classifiable to below 2 tribes. I recognize each one is an improvement of these base and not too far from.

<Quicks> GNU QuickSort, Pattern-defeating QuickSort (PdqSort), IntroSort
<Merges> TimSort, QuadSort, GNU Stable Sort, Block Sort

The result table is below.

8/10/12 elements' values are all combination results. 100/512/1000 elements' values are the result of the same number sorting cycles.

This is my PC ability limit to end test within realistic time and to retain measurement data.

Comparison and Move count measurement in Sorting
Data amount		Comparison			Move
	minimum	maximum	average	minimum	maximum	average
Selection sort
8	28	28	28.00000	0	14	10.56429
10	45	45	45.00000	0	18	14.14206
12	66	66	66.00000	0	22	17.79358
100	4950	4950	4950.00000	180	198	190.12000
512	130816	130816	130816.00000	996	1022	1010.32813
1000	499500	499500	499500.00000	1966	1998	1985.18800
Insertion sort
8	7	28	19.28214	0	35	19.28214
10	9	45	29.57103	0	54	29.57103
12	11	66	41.89679	0	77	41.89679
100	2135	2987	2577.11000	2133	2988	2577.17000
512	61020	71182	65994.41992	61016	71181	65994.26172
1000	236268	268100	251069.41000	236269	268097	251069.47100
Shell sort
8	11	26	17.91667	0	32	16.24048
10	15	37	25.51667	0	42	22.72937
12	21	52	34.71905	0	59	29.23622
100	689	812	732.76000	527	670	587.21000
512	5544	6101	5785.70508	4258	4964	4579.62109
1000	12609	13541	13053.70600	9739	10864	10286.34600
Binary Insertion sort
8	13	17	15.59048	0	35	19.28214
10	19	25	22.21270	0	54	29.57103
12	25	33	29.42482	0	77	41.89679
100	520	540	530.77000	2133	2988	2577.17000
512	3876	3936	3903.72070	61016	71181	65994.26172
1000	8547	8629	8586.72000	236269	268097	251069.47100
Bubble Sort
8	7	28	25.99529	0	56	28.00000
10	9	45	41.99319	0	90	45.00000
12	11	66	61.90309	0	132	66.00000
100	4544	4950	4875.77000	4080	5786	4964.76000
512	128041	130816	130381.95703	121026	141350	130978.04688
1000	491874	499500	498604.20500	470546	534210	500153.79200
Gnome Sort
8	7	56	33.28214	0	56	28.00000
10	9	90	52.07103	0	90	45.00000
12	11	132	74.89679	0	132	66.00000
100	4175	5880	5059.49000	4080	5786	4964.76000
512	121533	141857	131483.44336	121026	141350	130978.04688
1000	471541	535205	501146.30600	470546	534210	500153.79200
QuickSort
8	22	43	31.45833	0	28	14.58571
10	32	60	43.49929	0	38	19.89810
12	42	79	55.82187	0	50	25.64283
100	766	1048	837.44000	312	406	370.04000
512	5399	6559	5726.65039	2338	2618	2476.96094
1000	11612	13847	12314.54900	5088	5520	5300.98800
QuickSort (No overlapping?)
8	13	19	15.77619	0	28	14.58571
10	19	29	22.58571	0	38	19.89810
12	25	41	30.02121	0	48	25.46762
100	531	746	574.37000	316	396	358.28000
512	4007	5109	4329.97852	2248	2534	2398.58594
1000	8908	10996	9596.64700	4884	5332	5141.92800
HeapSort
8	21	31	26.53512	26	56	40.04683
10	31	47	39.49517	36	82	56.62910
12	41	65	53.17458	46	110	74.08227
100	1029	1087	1056.76000	1190	1312	1247.64000
512	7715	7928	7817.87891	8620	9044	8812.53906
1000	17073	17392	17226.73000	18894	19480	19185.00800
HeapSort (Avoid comparison with ancestor)
8	16	30	23.39206	26	56	40.04683
10	23	44	34.28377	36	82	56.62910
12	30	61	46.54484	46	110	74.08227
100	920	1003	961.27000	1190	1312	1247.64000
512	7170	7447	7313.25391	8620	9044	8812.53906
1000	15995	16479	16228.24500	18894	19480	19185.00800
Top Down Merge Sort (Not-In-place)
8	12	17	15.73333	48	48	48.00000
10	15	25	22.66667	68	68	68.00000
12	20	33	29.95238	88	88	88.00000
100	526	554	541.33000	1344	1344	1344.00000
512	3924	4009	3962.76367	9216	9216	9216.00000
1000	8656	8757	8707.68900	19952	19952	19952.00000
Merge (2^N) Sort (Bottom Up, Not-In-place)
8	12	17	15.73333	48	48	48.00000
10	15	27	23.84444	72	72	72.00000
12	20	33	30.35556	88	88	88.00000
100	535	574	557.76000	1376	1376	1376.00000
512	3924	4009	3962.76367	9216	9216	9216.00000
1000	8661	8771	8719.75200	19968	19968	19968.00000
Merge (2^N) Sort, Half copy (Bottom Up, Not-In-place)
8	12	17	15.73333	24	24	24.00000
10	15	27	23.84444	40	40	40.00000
12	20	33	30.35556	48	48	48.00000
100	535	574	557.76000	700	700	700.00000
512	3924	4009	3962.76367	4608	4608	4608.00000
1000	8661	8771	8719.75200	10000	10000	10000.00000

Each algorithms nature seems to be more clear than conventional landau symbol and sorting time test result.

QuickSort/HeapSort/Merge Sort Bottom up approaches have 2 types. One is original. The other is a faster version altering as small as possible the wasteful part.

QuickSort faster one is very ugly. GNU QuickSort waives before such and does Insertion sort at last. GNU QuickSort is right for practical use. My one is only an experimental try.

I choose my idea, as Merge Sort Bottom Up approach, fixing sort size to 2^N, by the reason of commonality to In-place Merge Sort of next chapter. This speed up version uses the way only possible in object-oriented language. QuadSort has done the same level move count decrement to any buffer copying. So QuadSort is better than this in Merge Sort using buffer.

In usual time measurement tests, one process time of both comparison and move is almost the same. By that, addition value of comparison count and move count in this table can be regarded time measurement test result. Seeing this table, QuickSort values are well-balanced both comparison and move. It agrees with the result QuickSort is fastest in time measurement test.

I warn you of one thing here. One move process time is almost the same both time measurement test and fact sort. One comparison process time of fact sort becomes larger than test. This is the main factor that sorting needs time. QuickSort comparison count is not smallest. Merge Sort is smaller.

I did a time measurement test of GNU QuickSort and In-place Merge Sort, that I explain later. GNU qsort is clearly faster in pure time test. But only adding 1usec wait to comparison process, In-place Merge Sort becomes faster. Merge Sort is better for heavy comparison process sorting. (I think it is common sort demand.)

If you see this table carefully, you may find an objection. In fact, the smallest comparison count sort is Binary Insertion Sort. Its move count is too large. So it never be the fastest in time measurement test. Array sorting is an unspoken agreement. Its move count in array sorting is O(N^2). But in linked list, its move count is O(N). Isn't data structure:"Linked list" and algorithm:"Binary Insertion Sort" combination fastest? For the confirmation of it, I attach linked list structure to array.(DataList.cs) Below table is its count measurement result.

Binary Insertion sort (List) List move count
8	13	17	15.59048	0	21	15.84643
10	19	25	22.21270	0	27	21.21310
12	25	33	29.42482	0	33	26.69037
100	520	540	530.77000	267	294	284.37000
512	3876	3936	3903.72070	1500	1533	1515.71484
1000	8547	8629	8586.72000	2943	2994	2977.72500
Binary Insertion sort (List) Data move count
8	13	17	15.59048	0	14	10.56429
10	19	25	22.21270	0	18	14.14206
12	25	33	29.42482	0	22	17.79358
100	520	540	530.77000	180	198	190.12000
512	3876	3936	3903.72070	996	1022	1010.32813
1000	8547	8629	8586.72000	1966	1998	1985.18800

I attach this linked list part into sort process. It is smallest count sort algorithm both comparison and move. "Binary Insertion using tiny Linked List sort", I named it BillSort for short. I did time measurement of it. Contrary to expectation, it is very slow speed endless sorting.

I pull out the slow point to below. It takes time more than 10 times of other total. If thinking only CPU process time, we never understand what wastes such a lot of time.

C++

for (index k = half; k > 0; k--) {
    j = Next[j];
}

The endless reason is the transfer process time between cache and main memory of O(N^2). We can't hope fast speed to the algorithm using linked list.

It is sure that this CPU processing time is minimum in all sorting algorithm. If whole RAM of the computer works L1 cache speed, it will gain the fastest position. Such computer may come out in the future. So I place it in this article. But as a process in present PC, I throw away the possibility that Binary Insertion Sort becomes fastest.

We must be conscious of memory access of the process, not only CPU execution time.

In-place Merge Sort

I conclude in the previous chapter that Merge Sort is suitable for general sorting.

But there is one problem. Merge Sort is not In-place. Area O(N) except for sorted array is necessary. I think it is silly problem now from next things. Whole array data copy is not necessary. And present PC memory size have become sufficiently large from the time it was born and discussed.

But thinking cache, In-place Merge Sort is desirable and I have made the algorithm I show now. The base methods are Bottom Up approach and size limitation to 2^N.

Below three realizations are necessary for it. (InPlaceMerge.cs)

Exchange process of different size 2 areas that sit side by side.
Different size 2 areas merging. Fast process investigation.
In-place Merge Sort move count is O(NlogN^2). Not-In-place Merge Sort move count is O(NlogN). When array element amount is large, Not-In-place surely got ahead of In-place. So O(NlogN) method of In-place is necessary.

(1) Exchange process of different size 2 areas that sits side by side.

Shifting elements one by one. The method comes first of all. Knuth supposed the faster idea, but there is still no realized program. That's all I found in internet articles.

But already good way is used in GNU stable sort. Reverse each area independently. And reverse both connected area.

Before knowing it, I found out a different way.

Shift smaller area to its integer multiple length. (This length must not exceed larger area length.)
If (larger area)/(smaller area) = integer, this process finishes. Otherwise, do the same 1.process between shifted smaller area and surplus area of larger area.
Repeat it until reaching.

It is just the same algorithm as greatest common divisor calculation. I name it GCD exchange.

Supposing each area size M, N. GNU always needs 2(M+N) move. GCD needs minimum (M+N) and it is convergent until 2(M+N). Measurement result is below table. And it can divide shifting size within cache size. I think GCD is superior to GNU in practical use. This program list seems a little complex. But supposing that it is realized on logic circuit, it is the same level of complexity as other ways.

Merge (2^N) Sort (In-place, GNU exchange)
8	12	17	15.73333	0	46	25.02857
10	15	27	23.84444	0	78	40.96190
12	20	33	30.35556	0	110	57.40433
100	533	573	557.85000	2682	3868	3307.00000
512	3910	4007	3962.97852	63900	76296	70696.91406
1000	8654	8771	8719.70700	245960	285706	266366.66400
Merge (2^N) Sort (In-place, GCD exchange)
8	12	17	15.70476	0	33	18.49524
10	15	27	23.26032	0	52	28.60635
12	20	33	30.14113	0	72	39.64978
100	530	568	553.34000	1550	2159	1867.88000
512	3910	3994	3953.73047	33723	39902	37093.05664
1000	8640	8754	8700.99400	126922	146627	137043.47400

(2) Speed investigation of different size two areas merging.

Generally Merge Sort means Top Down. Bottom Up is sometimes no regard.

We can see in the previous chapter table that Top Down Merge Sort is faster than Bottom Up in some points. Top Down always does the almost same size merge. Bottom Up must do largely different size merge in some case. We must consider this correspondence for speed up.

Below is the table of the case merging 1 ~ 8 elements into 8 elements. Combination of top element location of 1 ~ 8 elements in 8 elements area (9 location).

Area Data	Insert Data Amount
Amount	1	2	3	4	5	6	7	8
	1	9	45	165	495	1287	3003	6435
8	1	8	36	120	330	792	1716	3432
	1	7	28	84	210	462	924	1716
	1	6	21	56	126	252	462	792
	1	5	15	35	70	126	210	330
	1	4	10	20	35	56	84	120
	1	3	6	10	15	21	28	36
	1	2	3	4	5	6	7	8
	1	1	1	1	1	1	1	1
Total	9	45	165	495	1287	3003	6435	12870

This is Σ formula calculation we learned in high school.

More than half (4 ~ 8 elements): Merge from top as usual is no problem. (Preciously saying about 4~7 elements, next of top is its center location.)
1 element: Binary search and Insertion. Binary Search itself.
2 elements: Calculation of multiplication, and subtraction of decrement value. Serious calculation is not burden for CPU.
Others : Calculation of N-power formula, and subtraction of (N-1)-power formula. When element amount is large, it exceeds negligible time and complexity. It is not result but initial value of searching. So I decided start value around. When element amount exceeds 2, calculate 2 elements value and multiply 1/2^(N-2). It is this time approximation.

This alteration is already done to the above result table. So the counts are improved from the previous article.

(3) Making In-place Merge Sort exchange count to O(NlogN)

I said in the previous article, when data amount is large, switching to Not-In-place Merge Sort was necessary.

O(NlogN^2) is absolutely decition thing as long as merge starts from top. But I have found an idea to decrease move count of In-place merge.

Merge from the center, and do area division, like Quick sort. At first, get one center element from large area. Search its insert location in small area. The probability density would be mountain shape that center is highest. So search from center, by 1 step, in small area. One merge is divided into 2 small merges. Massive area moves at once and sum of move count decreases straight away. I name it center-to-center merge.

Its Landau symbol N degree clearly goes down. But I can't fix that it is truly O(NlogN), because I am not mathematical theory expert.

Comparison count increases. It has sub effect.

Merge (2^N) Sort (In-place, Center-to-center merge)
8	13	23	18.46190	0	33	19.00952
10	18	35	27.55238	0	50	29.18730
12	24	44	35.52915	0	70	39.98225
100	637	698	663.45000	1130	1339	1219.56000
512	4741	4963	4855.43359	11516	12248	11881.91992
1000	10551	10890	10736.78400	28425	29889	29095.46700

But it has clear merit.

Compared to original merge, the transfer interval between cache and main memory is reduced.
Plural merges can be processed in parallel processing.

Not-In-place merge using buffer can't realize this. The algorithm becomes pure In-place Merge Sort process, leaving from the spell.

Switching 2 In-place merge methods seem good. Within the cacheable size, merge using "from top" methods and make sorted blocks. When the size exceeds cacheable, merge using center-to-center merge method.

I think it comes almost terminus of In-place Merge Sort on single core. Though some adjustment may be necessary, necessary technology should have already come out. "Merge, In-place, Cache adaptable, Exponent of 2 (2^N)", I name it MiceSort for short.

MiceSort (Merge,In-place,Cache adaptable,Exponent of 2(2^N))
8	12	17	15.69048	0	33	18.49524
10	16	25	22.60159	0	52	28.60635
12	21	35	29.88644	0	72	39.64978
100	522	555	541.98000	1550	2159	1867.88000
512	3917	3996	3955.87891	24785	29228	27130.36328
1000	8656	8798	8722.83000	64481	75087	70091.43700

In measurement result, comparison count is almost the same as the original and move count decreases. I have converted it to C-language and done time competition with GNU qsort. It is slower than qsort when comparison process is pure comparison only. But it becomes faster when comparison process includes 1 usec wait.

Parallel Sorting

In-place merge process can be transformed to Parallel processing.(Parallel.cs) I intend to make documentation about this process as well as possible.

Run sort robots as same amount as CPU multi-core. These robots do the same operation. Each robot cleans up the sort process independently. When all sort process is tidied up, the sort is ending.

If same cleaning robots all you have run at the same time, spacious room cleaning time is shortened to 1/(robot amount). That is this process image.

We have been experiencing break down of "Unhandled exception" or deadlock, when we run tasks one after another asynchronously at any parallelizable points. Even if we intend to limit task to core amount, we will encounter exception of unsure reason by silly modification.

One radical settlement is to stop parallel processing. Another is the method shown here. The same amount of tasks as core keep on running. Otherwise, no one can predict the parallel process works safely on any circumstances.
Divide sort data into cacheable size blocks. One process changes these blocks into sorted blocks. Other process merges these blocks into merged blocks and merge to higher level. These 2 states are necessary.

Prepare a array showing each block situation, not-sorted or under sorting or sorted, and block size. Each sort robot refers this array and knows unprocessed things and cleans up what it must do.

At the time top level merge finished, the sort is completed.

Sort data are nonuniformity, mix of almost sorted place and random place. Cache size block have effect to smooth workload of each core, too.

Previous article ended here. While climax large merging, operating cores faded away one after another. Top level whole array size merge was done by one core totally. It is very wasteful for CPU including many cores. I add next process.
In merge process of sorted blocks, use center-to-center merge method. By that, the merge process is divided into many merges. Push these merges to stack. When all merges in stack is cleared off, the merge finishes.
Make management center that all stacks of each core are got together. This center can be referred from every core. When all job it must do is end, the core gets a merge from this center, largest merge of stack bottom.

By that, each core works parallelly until the very end.

I name it "Both sides now sort". From song "Both sides now" lyrics.

Though present situation is not cheerful, I believe this is the fastest sort algorithm, and it almost reaches near to possible limit of sorting on multi-core CPU, and it is basement of future parallel sorting.

I use goto sentence in this program list. In fact, I made goto-less structured program, but I threw away it because it is difficult to guess the process purpose. To keep process order from top to bottom is impossible now, by the obstruction of parallel, asynchronous, callback, etc.. For the readable description, using goto sentence efficiently is necessary now. It is my opinion.

Notes of Attached Program

One is that it sometimes judges sorting is not good.

If you encounter the problem in the execution of attached list, you must increase number of Sleep(1) in the program to OK value. 1usec is number for small data element.

I know lock is necessary to this settlement. But lock makes slow the sorting. I judged wait time after sorting is better.

Another is present program doesn't work fastest. It is slower than single core Mice Sort. Though the reason may be I test it on i5 note(2 core).

But I think alteration point still remains. I have used lock sentence everywhere conflict can occur for safety. I already know some lock can be vanished. Reconsideration around lock is necessary.

Speed check on Windows C# may be meaningless. The process time varies largely by conditions, only moving process location, and Visual Studio runs or not. I can't stop background task of Windows.

Left Things

The attached program includes constant like cache size. The value in it is empirically fast value on note i5. It is not adapted to theoretically fastest value. It should differ by CPU and it should differ on your PC. In fact sorting, if comparison process of the sort uses many work areas or table, this value should differ. On little core CPU like the one I use now, OS task should occupy some cache of sorting core. Still, I have seen the speed difference with the change of these constant in this experiment. Because CPU has plural level caches, the consciousness to cache in the program affect to speed.

Fix of the value is impossible for me. The person who can do it is a CPU maker engineer knowing cache control logic.

Or only user programmer can define the efficient value to his purpose.

I attempted to convert "Both sides now sort" to C-language for the competition with qsort. But C-language parallel process is environment‐dependence. C# shape is easy and convenient for me even if it is slow. Even if converting program working on Linux, it is meaningless for me. I don't have expensive many-core PC. Until no environment‐dependence parallel sentence of C-language comes, I postpone it.

I intend to alter around lock before that.

Though I feel this may be the last one, I write about this theme.

History

19^th August, 2020: Initial version
27^th January, 2020: Updated article and source code

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)