This is an introduction of my original in-place merge sort algorithm. First, considering what is fastest sort from theoretical analysis. Second, explaining my sort program and applying it to cache and multi-core. Third, from measurement, confirming it is top level speed. The conclusion is, my sort can be said to be the fastest, from cache and multi-core utilization level.
Introduction
This article is about In-place Merge Sort algorithm, fitted with parallel processing and providing with translation reduction between cache and main memory. I wrote the previous version at the time I confirmed the breakthrough idea's operation. It included many incomplete parts. I have considered all and it reaches almost consistent level. So I revise this edition.
I don't change my opinion "this is the fastest sort algorithm" at all. However, the present level is only a confirmation of the operation by C#. Program of C# never works fastest. However, I think this article content is useful to some readers annoyed by slow sort speed.
If I will explain in detail, this article will become as long as a book. So I apologize, this article is only summary level. I hope you to read the attached C# list and figure out the parts that have not been explained in this article. I intend to take care of easy expression of the program (though some rules are not common).
The Article Body
Competition of Sorting Time
The time that CPU executes sorting process is expressed by the below formula.
Sorting time = (1)The time of sort process itself + (2-1) One data comparison (evaluate) time * (2-2) comparison count + (3-1) One data move time * (3-2) move count
(1) can be ignored because of too small against others. (2-2)(3-2) are values showing sort algorithm speed. I measured representative sorting algorithm (Sortings.cs) counts actually.
I omit all algorithm explanation because it can easily be searched on the internet.
To make clear the fairness of count to all algorithm, data encapsulation prohibits each algorithm to operate data directly. (DataArray.cs)
I omit Radix sort because it is not valid to whole sort demand.
And I give up below commonly known hybrid algorithms measurement. The implant of count process for these original program is, I can't judge whether these implants are correct or not. Converting the process to this program shape for measurement is difficult and I can't judge conversion is OK or not.
These sortings are classifiable to below 2 tribes. I recognize each one is an improvement of these base and not too far from.
- <Quicks> GNU QuickSort, Pattern-defeating QuickSort (PdqSort), IntroSort
- <Merges> TimSort, QuadSort, GNU Stable Sort, Block Sort
The result table is below.
8/10/12 elements' values are all combination results. 100/512/1000 elements' values are the result of the same number sorting cycles.
This is my PC ability limit to end test within realistic time and to retain measurement data.
Comparison and Move count measurement in Sorting |
Data amount | | Comparison | | | Move | |
| minimum | maximum | average | minimum | maximum | average |
Selection sort |
8 | 28 | 28 | 28.00000 | 0 | 14 | 10.56429 |
10 | 45 | 45 | 45.00000 | 0 | 18 | 14.14206 |
12 | 66 | 66 | 66.00000 | 0 | 22 | 17.79358 |
100 | 4950 | 4950 | 4950.00000 | 180 | 198 | 190.12000 |
512 | 130816 | 130816 | 130816.00000 | 996 | 1022 | 1010.32813 |
1000 | 499500 | 499500 | 499500.00000 | 1966 | 1998 | 1985.18800 |
Insertion sort |
8 | 7 | 28 | 19.28214 | 0 | 35 | 19.28214 |
10 | 9 | 45 | 29.57103 | 0 | 54 | 29.57103 |
12 | 11 | 66 | 41.89679 | 0 | 77 | 41.89679 |
100 | 2135 | 2987 | 2577.11000 | 2133 | 2988 | 2577.17000 |
512 | 61020 | 71182 | 65994.41992 | 61016 | 71181 | 65994.26172 |
1000 | 236268 | 268100 | 251069.41000 | 236269 | 268097 | 251069.47100 |
Shell sort |
8 | 11 | 26 | 17.91667 | 0 | 32 | 16.24048 |
10 | 15 | 37 | 25.51667 | 0 | 42 | 22.72937 |
12 | 21 | 52 | 34.71905 | 0 | 59 | 29.23622 |
100 | 689 | 812 | 732.76000 | 527 | 670 | 587.21000 |
512 | 5544 | 6101 | 5785.70508 | 4258 | 4964 | 4579.62109 |
1000 | 12609 | 13541 | 13053.70600 | 9739 | 10864 | 10286.34600 |
Binary Insertion sort |
8 | 13 | 17 | 15.59048 | 0 | 35 | 19.28214 |
10 | 19 | 25 | 22.21270 | 0 | 54 | 29.57103 |
12 | 25 | 33 | 29.42482 | 0 | 77 | 41.89679 |
100 | 520 | 540 | 530.77000 | 2133 | 2988 | 2577.17000 |
512 | 3876 | 3936 | 3903.72070 | 61016 | 71181 | 65994.26172 |
1000 | 8547 | 8629 | 8586.72000 | 236269 | 268097 | 251069.47100 |
Bubble Sort |
8 | 7 | 28 | 25.99529 | 0 | 56 | 28.00000 |
10 | 9 | 45 | 41.99319 | 0 | 90 | 45.00000 |
12 | 11 | 66 | 61.90309 | 0 | 132 | 66.00000 |
100 | 4544 | 4950 | 4875.77000 | 4080 | 5786 | 4964.76000 |
512 | 128041 | 130816 | 130381.95703 | 121026 | 141350 | 130978.04688 |
1000 | 491874 | 499500 | 498604.20500 | 470546 | 534210 | 500153.79200 |
Gnome Sort |
8 | 7 | 56 | 33.28214 | 0 | 56 | 28.00000 |
10 | 9 | 90 | 52.07103 | 0 | 90 | 45.00000 |
12 | 11 | 132 | 74.89679 | 0 | 132 | 66.00000 |
100 | 4175 | 5880 | 5059.49000 | 4080 | 5786 | 4964.76000 |
512 | 121533 | 141857 | 131483.44336 | 121026 | 141350 | 130978.04688 |
1000 | 471541 | 535205 | 501146.30600 | 470546 | 534210 | 500153.79200 |
QuickSort |
8 | 22 | 43 | 31.45833 | 0 | 28 | 14.58571 |
10 | 32 | 60 | 43.49929 | 0 | 38 | 19.89810 |
12 | 42 | 79 | 55.82187 | 0 | 50 | 25.64283 |
100 | 766 | 1048 | 837.44000 | 312 | 406 | 370.04000 |
512 | 5399 | 6559 | 5726.65039 | 2338 | 2618 | 2476.96094 |
1000 | 11612 | 13847 | 12314.54900 | 5088 | 5520 | 5300.98800 |
QuickSort (No overlapping?) |
8 | 13 | 19 | 15.77619 | 0 | 28 | 14.58571 |
10 | 19 | 29 | 22.58571 | 0 | 38 | 19.89810 |
12 | 25 | 41 | 30.02121 | 0 | 48 | 25.46762 |
100 | 531 | 746 | 574.37000 | 316 | 396 | 358.28000 |
512 | 4007 | 5109 | 4329.97852 | 2248 | 2534 | 2398.58594 |
1000 | 8908 | 10996 | 9596.64700 | 4884 | 5332 | 5141.92800 |
HeapSort |
8 | 21 | 31 | 26.53512 | 26 | 56 | 40.04683 |
10 | 31 | 47 | 39.49517 | 36 | 82 | 56.62910 |
12 | 41 | 65 | 53.17458 | 46 | 110 | 74.08227 |
100 | 1029 | 1087 | 1056.76000 | 1190 | 1312 | 1247.64000 |
512 | 7715 | 7928 | 7817.87891 | 8620 | 9044 | 8812.53906 |
1000 | 17073 | 17392 | 17226.73000 | 18894 | 19480 | 19185.00800 |
HeapSort (Avoid comparison with ancestor) |
8 | 16 | 30 | 23.39206 | 26 | 56 | 40.04683 |
10 | 23 | 44 | 34.28377 | 36 | 82 | 56.62910 |
12 | 30 | 61 | 46.54484 | 46 | 110 | 74.08227 |
100 | 920 | 1003 | 961.27000 | 1190 | 1312 | 1247.64000 |
512 | 7170 | 7447 | 7313.25391 | 8620 | 9044 | 8812.53906 |
1000 | 15995 | 16479 | 16228.24500 | 18894 | 19480 | 19185.00800 |
Top Down Merge Sort (Not-In-place) |
8 | 12 | 17 | 15.73333 | 48 | 48 | 48.00000 |
10 | 15 | 25 | 22.66667 | 68 | 68 | 68.00000 |
12 | 20 | 33 | 29.95238 | 88 | 88 | 88.00000 |
100 | 526 | 554 | 541.33000 | 1344 | 1344 | 1344.00000 |
512 | 3924 | 4009 | 3962.76367 | 9216 | 9216 | 9216.00000 |
1000 | 8656 | 8757 | 8707.68900 | 19952 | 19952 | 19952.00000 |
Merge (2^N) Sort (Bottom Up, Not-In-place) |
8 | 12 | 17 | 15.73333 | 48 | 48 | 48.00000 |
10 | 15 | 27 | 23.84444 | 72 | 72 | 72.00000 |
12 | 20 | 33 | 30.35556 | 88 | 88 | 88.00000 |
100 | 535 | 574 | 557.76000 | 1376 | 1376 | 1376.00000 |
512 | 3924 | 4009 | 3962.76367 | 9216 | 9216 | 9216.00000 |
1000 | 8661 | 8771 | 8719.75200 | 19968 | 19968 | 19968.00000 |
Merge (2^N) Sort, Half copy (Bottom Up, Not-In-place) |
8 | 12 | 17 | 15.73333 | 24 | 24 | 24.00000 |
10 | 15 | 27 | 23.84444 | 40 | 40 | 40.00000 |
12 | 20 | 33 | 30.35556 | 48 | 48 | 48.00000 |
100 | 535 | 574 | 557.76000 | 700 | 700 | 700.00000 |
512 | 3924 | 4009 | 3962.76367 | 4608 | 4608 | 4608.00000 |
1000 | 8661 | 8771 | 8719.75200 | 10000 | 10000 | 10000.00000 |
Each algorithms nature seems to be more clear than conventional landau symbol and sorting time test result.
QuickSort/HeapSort/Merge Sort Bottom up approaches have 2 types. One is original. The other is a faster version altering as small as possible the wasteful part.
QuickSort faster one is very ugly. GNU QuickSort waives before such and does Insertion sort at last. GNU QuickSort is right for practical use. My one is only an experimental try.
I choose my idea, as Merge Sort Bottom Up approach, fixing sort size to 2^N, by the reason of commonality to In-place Merge Sort of next chapter. This speed up version uses the way only possible in object-oriented language. QuadSort has done the same level move count decrement to any buffer copying. So QuadSort is better than this in Merge Sort using buffer.
In usual time measurement tests, one process time of both comparison and move is almost the same. By that, addition value of comparison count and move count in this table can be regarded time measurement test result. Seeing this table, QuickSort values are well-balanced both comparison and move. It agrees with the result QuickSort is fastest in time measurement test.
I warn you of one thing here. One move process time is almost the same both time measurement test and fact sort. One comparison process time of fact sort becomes larger than test. This is the main factor that sorting needs time. QuickSort comparison count is not smallest. Merge Sort is smaller.
I did a time measurement test of GNU QuickSort and In-place Merge Sort, that I explain later. GNU qsort is clearly faster in pure time test. But only adding 1usec wait to comparison process, In-place Merge Sort becomes faster. Merge Sort is better for heavy comparison process sorting. (I think it is common sort demand.)
If you see this table carefully, you may find an objection. In fact, the smallest comparison count sort is Binary Insertion Sort. Its move count is too large. So it never be the fastest in time measurement test. Array sorting is an unspoken agreement. Its move count in array sorting is O(N^2). But in linked list, its move count is O(N). Isn't data structure:"Linked list" and algorithm:"Binary Insertion Sort" combination fastest? For the confirmation of it, I attach linked list structure to array.(DataList.cs) Below table is its count measurement result.
Binary Insertion sort (List) List move count |
8 | 13 | 17 | 15.59048 | 0 | 21 | 15.84643 |
10 | 19 | 25 | 22.21270 | 0 | 27 | 21.21310 |
12 | 25 | 33 | 29.42482 | 0 | 33 | 26.69037 |
100 | 520 | 540 | 530.77000 | 267 | 294 | 284.37000 |
512 | 3876 | 3936 | 3903.72070 | 1500 | 1533 | 1515.71484 |
1000 | 8547 | 8629 | 8586.72000 | 2943 | 2994 | 2977.72500 |
Binary Insertion sort (List) Data move count |
8 | 13 | 17 | 15.59048 | 0 | 14 | 10.56429 |
10 | 19 | 25 | 22.21270 | 0 | 18 | 14.14206 |
12 | 25 | 33 | 29.42482 | 0 | 22 | 17.79358 |
100 | 520 | 540 | 530.77000 | 180 | 198 | 190.12000 |
512 | 3876 | 3936 | 3903.72070 | 996 | 1022 | 1010.32813 |
1000 | 8547 | 8629 | 8586.72000 | 1966 | 1998 | 1985.18800 |
I attach this linked list part into sort process. It is smallest count sort algorithm both comparison and move. "Binary Insertion using tiny Linked List sort", I named it BillSort
for short. I did time measurement of it. Contrary to expectation, it is very slow speed endless sorting.
I pull out the slow point to below. It takes time more than 10 times of other total. If thinking only CPU process time, we never understand what wastes such a lot of time.
for (index k = half; k > 0; k--) {
j = Next[j];
}
The endless reason is the transfer process time between cache and main memory of O(N^2). We can't hope fast speed to the algorithm using linked list.
It is sure that this CPU processing time is minimum in all sorting algorithm. If whole RAM of the computer works L1 cache speed, it will gain the fastest position. Such computer may come out in the future. So I place it in this article. But as a process in present PC, I throw away the possibility that Binary Insertion Sort becomes fastest.
We must be conscious of memory access of the process, not only CPU execution time.
In-place Merge Sort
I conclude in the previous chapter that Merge Sort is suitable for general sorting.
But there is one problem. Merge Sort is not In-place. Area O(N) except for sorted array is necessary. I think it is silly problem now from next things. Whole array data copy is not necessary. And present PC memory size have become sufficiently large from the time it was born and discussed.
But thinking cache, In-place Merge Sort is desirable and I have made the algorithm I show now. The base methods are Bottom Up approach and size limitation to 2^N.
Below three realizations are necessary for it. (InPlaceMerge.cs)
- Exchange process of different size 2 areas that sit side by side.
- Different size 2 areas merging. Fast process investigation.
- In-place Merge Sort move count is O(NlogN^2). Not-In-place Merge Sort move count is O(NlogN). When array element amount is large, Not-In-place surely got ahead of In-place. So O(NlogN) method of In-place is necessary.
(1) Exchange process of different size 2 areas that sits side by side.
Shifting elements one by one. The method comes first of all. Knuth supposed the faster idea, but there is still no realized program. That's all I found in internet articles.
But already good way is used in GNU stable sort. Reverse each area independently. And reverse both connected area.
Before knowing it, I found out a different way.
- Shift smaller area to its integer multiple length. (This length must not exceed larger area length.)
- If (larger area)/(smaller area) = integer, this process finishes. Otherwise, do the same 1.process between shifted smaller area and surplus area of larger area.
- Repeat it until reaching.
It is just the same algorithm as greatest common divisor calculation. I name it GCD exchange.
Supposing each area size M, N. GNU always needs 2(M+N) move. GCD needs minimum (M+N) and it is convergent until 2(M+N). Measurement result is below table. And it can divide shifting size within cache size. I think GCD is superior to GNU in practical use. This program list seems a little complex. But supposing that it is realized on logic circuit, it is the same level of complexity as other ways.
Merge (2^N) Sort (In-place, GNU exchange) |
8 | 12 | 17 | 15.73333 | 0 | 46 | 25.02857 |
10 | 15 | 27 | 23.84444 | 0 | 78 | 40.96190 |
12 | 20 | 33 | 30.35556 | 0 | 110 | 57.40433 |
100 | 533 | 573 | 557.85000 | 2682 | 3868 | 3307.00000 |
512 | 3910 | 4007 | 3962.97852 | 63900 | 76296 | 70696.91406 |
1000 | 8654 | 8771 | 8719.70700 | 245960 | 285706 | 266366.66400 |
Merge (2^N) Sort (In-place, GCD exchange) |
8 | 12 | 17 | 15.70476 | 0 | 33 | 18.49524 |
10 | 15 | 27 | 23.26032 | 0 | 52 | 28.60635 |
12 | 20 | 33 | 30.14113 | 0 | 72 | 39.64978 |
100 | 530 | 568 | 553.34000 | 1550 | 2159 | 1867.88000 |
512 | 3910 | 3994 | 3953.73047 | 33723 | 39902 | 37093.05664 |
1000 | 8640 | 8754 | 8700.99400 | 126922 | 146627 | 137043.47400 |
(2) Speed investigation of different size two areas merging.
Generally Merge Sort means Top Down. Bottom Up is sometimes no regard.
We can see in the previous chapter table that Top Down Merge Sort is faster than Bottom Up in some points. Top Down always does the almost same size merge. Bottom Up must do largely different size merge in some case. We must consider this correspondence for speed up.
Below is the table of the case merging 1 ~ 8 elements into 8 elements. Combination of top element location of 1 ~ 8 elements in 8 elements area (9 location).
Area Data | Insert Data Amount |
Amount | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| 1 | 9 | 45 | 165 | 495 | 1287 | 3003 | 6435 |
8 | 1 | 8 | 36 | 120 | 330 | 792 | 1716 | 3432 |
| 1 | 7 | 28 | 84 | 210 | 462 | 924 | 1716 |
| 1 | 6 | 21 | 56 | 126 | 252 | 462 | 792 |
| 1 | 5 | 15 | 35 | 70 | 126 | 210 | 330 |
| 1 | 4 | 10 | 20 | 35 | 56 | 84 | 120 |
| 1 | 3 | 6 | 10 | 15 | 21 | 28 | 36 |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Total | 9 | 45 | 165 | 495 | 1287 | 3003 | 6435 | 12870 |
This is Σ formula calculation we learned in high school.
- More than half (4 ~ 8 elements): Merge from top as usual is no problem. (Preciously saying about 4~7 elements, next of top is its center location.)
- 1 element: Binary search and Insertion. Binary Search itself.
- 2 elements: Calculation of multiplication, and subtraction of decrement value. Serious calculation is not burden for CPU.
- Others : Calculation of N-power formula, and subtraction of (N-1)-power formula. When element amount is large, it exceeds negligible time and complexity. It is not result but initial value of searching. So I decided start value around. When element amount exceeds 2, calculate 2 elements value and multiply 1/2^(N-2). It is this time approximation.
This alteration is already done to the above result table. So the counts are improved from the previous article.
(3) Making In-place Merge Sort exchange count to O(NlogN)
I said in the previous article, when data amount is large, switching to Not-In-place Merge Sort was necessary.
O(NlogN^2) is absolutely decition thing as long as merge starts from top. But I have found an idea to decrease move count of In-place merge.
Merge from the center, and do area division, like Quick sort. At first, get one center element from large area. Search its insert location in small area. The probability density would be mountain shape that center is highest. So search from center, by 1 step, in small area. One merge is divided into 2 small merges. Massive area moves at once and sum of move count decreases straight away. I name it center-to-center merge.
Its Landau symbol N degree clearly goes down. But I can't fix that it is truly O(NlogN), because I am not mathematical theory expert.
Comparison count increases. It has sub effect.
Merge (2^N) Sort (In-place, Center-to-center merge) |
8 | 13 | 23 | 18.46190 | 0 | 33 | 19.00952 |
10 | 18 | 35 | 27.55238 | 0 | 50 | 29.18730 |
12 | 24 | 44 | 35.52915 | 0 | 70 | 39.98225 |
100 | 637 | 698 | 663.45000 | 1130 | 1339 | 1219.56000 |
512 | 4741 | 4963 | 4855.43359 | 11516 | 12248 | 11881.91992 |
1000 | 10551 | 10890 | 10736.78400 | 28425 | 29889 | 29095.46700 |
But it has clear merit.
- Compared to original merge, the transfer interval between cache and main memory is reduced.
- Plural merges can be processed in parallel processing.
Not-In-place merge using buffer can't realize this. The algorithm becomes pure In-place Merge Sort process, leaving from the spell.
Switching 2 In-place merge methods seem good. Within the cacheable size, merge using "from top" methods and make sorted blocks. When the size exceeds cacheable, merge using center-to-center merge method.
I think it comes almost terminus of In-place Merge Sort on single core. Though some adjustment may be necessary, necessary technology should have already come out. "Merge, In-place, Cache adaptable, Exponent of 2 (2^N)", I name it MiceSort for short.
MiceSort (Merge,In-place,Cache adaptable,Exponent of 2(2^N)) |
8 | 12 | 17 | 15.69048 | 0 | 33 | 18.49524 |
10 | 16 | 25 | 22.60159 | 0 | 52 | 28.60635 |
12 | 21 | 35 | 29.88644 | 0 | 72 | 39.64978 |
100 | 522 | 555 | 541.98000 | 1550 | 2159 | 1867.88000 |
512 | 3917 | 3996 | 3955.87891 | 24785 | 29228 | 27130.36328 |
1000 | 8656 | 8798 | 8722.83000 | 64481 | 75087 | 70091.43700 |
In measurement result, comparison count is almost the same as the original and move count decreases. I have converted it to C-language and done time competition with GNU qsort. It is slower than qsort when comparison process is pure comparison only. But it becomes faster when comparison process includes 1 usec wait.
Parallel Sorting
In-place merge process can be transformed to Parallel processing.(Parallel.cs)
I intend to make documentation about this process as well as possible.
- Run sort robots as same amount as CPU multi-core. These robots do the same operation. Each robot cleans up the sort process independently. When all sort process is tidied up, the sort is ending.
If same cleaning robots all you have run at the same time, spacious room cleaning time is shortened to 1/(robot amount). That is this process image.
We have been experiencing break down of "Unhandled exception" or deadlock, when we run tasks one after another asynchronously at any parallelizable points. Even if we intend to limit task to core amount, we will encounter exception of unsure reason by silly modification.
One radical settlement is to stop parallel processing. Another is the method shown here. The same amount of tasks as core keep on running. Otherwise, no one can predict the parallel process works safely on any circumstances. - Divide sort data into cacheable size blocks. One process changes these blocks into sorted blocks. Other process merges these blocks into merged blocks and merge to higher level. These 2 states are necessary.
Prepare a array showing each block situation, not-sorted or under sorting or sorted, and block size. Each sort robot refers this array and knows unprocessed things and cleans up what it must do.
At the time top level merge finished, the sort is completed.
Sort data are nonuniformity, mix of almost sorted place and random place. Cache size block have effect to smooth workload of each core, too.
Previous article ended here. While climax large merging, operating cores faded away one after another. Top level whole array size merge was done by one core totally. It is very wasteful for CPU including many cores. I add next process. - In merge process of sorted blocks, use center-to-center merge method. By that, the merge process is divided into many merges. Push these merges to stack. When all merges in stack is cleared off, the merge finishes.
- Make management center that all stacks of each core are got together. This center can be referred from every core. When all job it must do is end, the core gets a merge from this center, largest merge of stack bottom.
By that, each core works parallelly until the very end.
I name it "Both sides now sort". From song "Both sides now" lyrics.
Though present situation is not cheerful, I believe this is the fastest sort algorithm, and it almost reaches near to possible limit of sorting on multi-core CPU, and it is basement of future parallel sorting.
I use goto sentence in this program list. In fact, I made goto-less structured program, but I threw away it because it is difficult to guess the process purpose. To keep process order from top to bottom is impossible now, by the obstruction of parallel, asynchronous, callback, etc.. For the readable description, using goto sentence efficiently is necessary now. It is my opinion.
Notes of Attached Program
One is that it sometimes judges sorting is not good.
If you encounter the problem in the execution of attached list, you must increase number of Sleep(1) in the program to OK value. 1usec is number for small data element.
I know lock is necessary to this settlement. But lock makes slow the sorting. I judged wait time after sorting is better.
Another is present program doesn't work fastest. It is slower than single core Mice Sort. Though the reason may be I test it on i5 note(2 core).
But I think alteration point still remains. I have used lock sentence everywhere conflict can occur for safety. I already know some lock can be vanished. Reconsideration around lock is necessary.
Speed check on Windows C# may be meaningless. The process time varies largely by conditions, only moving process location, and Visual Studio runs or not. I can't stop background task of Windows.
Left Things
The attached program includes constant like cache size. The value in it is empirically fast value on note i5. It is not adapted to theoretically fastest value. It should differ by CPU and it should differ on your PC. In fact sorting, if comparison process of the sort uses many work areas or table, this value should differ. On little core CPU like the one I use now, OS task should occupy some cache of sorting core. Still, I have seen the speed difference with the change of these constant in this experiment. Because CPU has plural level caches, the consciousness to cache in the program affect to speed.
Fix of the value is impossible for me. The person who can do it is a CPU maker engineer knowing cache control logic.
Or only user programmer can define the efficient value to his purpose.
I attempted to convert "Both sides now sort" to C-language for the competition with qsort. But C-language parallel process is environment‐dependence. C# shape is easy and convenient for me even if it is slow. Even if converting program working on Linux, it is meaningless for me. I don't have expensive many-core PC. Until no environment‐dependence parallel sentence of C-language comes, I postpone it.
I intend to alter around lock before that.
Though I feel this may be the last one, I write about this theme.
History
- 19th August, 2020: Initial version
- 27th January, 2020: Updated article and source code