Click here to Skip to main content
16,022,205 members
Please Sign up or sign in to vote.
1.00/5 (2 votes)
See more:
I have this algorithm:

C++
double x1[20];

double x2[20];

double x3[20];

double x4[20];

double x5[20];

double x6[20];

double x7[20];

double x8[20];

double x9[20];

double x10[20];

double x11[20];

double x12[20];

for(int i1 = 0;i1<=20;i1++)

for(int i2 = 0;i2<=20;i2++)

for(int i3 = 0;i3<=20;i3++)

for(int i4 = 0;i4<=20;i4++)

for(int i5 = 0;i5<=20;i5++)

for(int i6 = 0;i6<=20;i6++)

for(int i7 = 0;i7<=20;i7++)

for(int i8 = 0;i8<=20;i8++)

for(int i9 = 0;i9<=20;i9++)

for(int i10 = 0;i10<=20;i10++)

for(int i11 = 0;i11<=20;i11++)

for(int i12 = 0;i12<=20;i12++)

     f = x1[i1] + x2[i2] + x3[i3] + x4[i4] + x5[i5] + x6[i6] + x7...
I wanted to increase the speed of its execution. In fact, I have 20 modes to the power of 12!

I tried to use omp, got no response. I read something about BLAS, but I haven't tested it yet.

I took it to GPU and Cuda. Actually, I was trying to give the first ring and the second ring, which are repeated together 400 times, to 400 Cuda Core, so that at least two rings are removed. The speed got worse.
Of course, in the case that the size of the first four arrays is 1, the speed of C++ to execute 20 to the power of 8 is approximately 1300 milliseconds. But this mode is too much to run in a Cuda Core!
Now, if I want to increase the dimensions and, for example, instead of two rings of 400 pieces, give three rings of 8000 pieces to the Cuda!?

What I have tried:

I tried to use omp, got no response. I read something about BLAS, but I haven't tested it yet.
Posted
Updated 2-Aug-24 1:22am
v2
Comments
PIEBALDconsult 2-Aug-24 10:25am    
I don't see an algorithm.
Moharram 4-Aug-24 7:30am    
i want to calculate minimum of variable f.
PIEBALDconsult 4-Aug-24 10:32am    
The minimum of f is Double.MinValue
jeron1 2-Aug-24 10:55am    
ring == for loop?
0x01AA 2-Aug-24 11:13am    
I suggest to google for 'parallelization of nested for' and study the results

Quote:
Speed up an algorithm?

Difficult to tell since the whole piece of code is incomplete, bugged to the bone and stupid.
As is the sample code can be simplified as:
C++
double x1[20];

double x2[20];

double x3[20];

double x4[20];

double x5[20];

double x6[20];

double x7[20];

double x8[20];

double x9[20];

double x10[20];

double x11[20];

double x12[20];

for(int i1 = 0;i1<=20;i1++)

for(int i2 = 0;i2<=20;i2++)

for(int i3 = 0;i3<=20;i3++)

for(int i4 = 0;i4<=20;i4++)

for(int i5 = 0;i5<=20;i5++)

for(int i6 = 0;i6<=20;i6++)

for(int i7 = 0;i7<=20;i7++)

for(int i8 = 0;i8<=20;i8++)

for(int i9 = 0;i9<=20;i9++)

for(int i10 = 0;i10<=20;i10++)

for(int i11 = 0;i11<=20;i11++)

for(int i12 = 0;i12<=20;i12++)

     f = x1[20] + x2[20] + x3[20] + x4[20] + x5[20] + x6[20] + x7...

Lets guess the missing parts matters.
 
Share this answer
 
Comments
Moharram 8-Aug-24 0:06am    
that is a pseudocode.
Patrice T 8-Aug-24 9:10am    
Which is over simplified. And thus hiding the interesting parts than matters in the algorithm, which prevent ant helpful answers.
I've had good results with openmp. Assuming we can represent the body of the innermost loop as just a simple math operation we might have something like this:
C
#include <stdio.h>

int main()
{
 double x = 0;
 for(int i1 = 0; i1<20; i1++) {
  printf("i1 = %d\n", i1);
  for(int i2 = 0; i2<20; i2++)
   for(int i3 = 0; i3<20; i3++)
    for(int i4 = 0; i4<20; i4++)
     for(int i5 = 0; i5<20; i5++)
      for(int i6 = 0; i6<20; i6++)
       for(int i7 = 0; i7<20; i7++)
        for(int i8 = 0; i8<20; i8++)
         for(int i9 = 0; i9<20; i9++)
          for(int i10 = 0; i10<20; i10++)
           for(int i11 = 0; i11<20; i11++)
            for(int i12 = 0; i12<20; i12++)
             x += 0.1;
 }

 printf("x = %g\n", x);
 return 0;
}
Compiling and running that on a Ryzen 9 3900x (12C/24T), after at least 8 hours it had not started the second outermost loop. Adding a parallel for loop via OMP we get the following code:
C
#include <stdio.h>
int main()
{
 long double x = 1.0;
#pragma omp parallel for collapse(12)
 for(int i1 = 0; i1<20; i1++) {
  for(int i2 = 0; i2<20; i2++)
   for(int i3 = 0; i3<20; i3++)
    for(int i4 = 0; i4<20; i4++)
     for(int i5 = 0; i5<20; i5++)
      for(int i6 = 0; i6<20; i6++)
       for(int i7 = 0; i7<20; i7++)
        for(int i8 = 0; i8<20; i8++)
         for(int i9 = 0; i9<20; i9++)
          for(int i10 = 0; i10<20; i10++)
           for(int i11 = 0; i11<20; i11++)
            for(int i12 = 0; i12<20; i12++)
             x += 1.0;
 }

 printf("x = %Lg\n", x);
 return 0;
}
Note we've had to remove the printf from the outermost loop so that openmp can do its thing, but that's not going to significantly change the runtime, since the innermost body gets run about 4.0e15 times. Honestly, I did not expect that this would make enough difference that it would run in reasonable time. I was pleasantly surprised to find this ran in under 4 minutes! Of course whether this will work for you or not depends on exactly what the body of the innermost loop is actually doing, and how many threads you have available. But it might be worth trying.
 
Share this answer
 
Comments
merano99 4-Aug-24 12:40pm    
I have recently applied omp collapse on a server with 48 cores to tackle this problem. Since the task finished extremely quickly, I added a counter count in the inner for loop and included it in the reduction clause for verification. Indeed, the count did not show the expected value for N=20. For smaller values of N, however, count produced the expected result and the computation completed swiftly. My conclusion is that something goes wrong with collapse when N=20. I then merged the first two outer loops and set N to 10. The runtime with 48 threads was 400.8 seconds.
1. It is poor practice to hard-code the number 20 multiple times in the code. It would be better to define N = 20 in one place.
2. For a for-loop defined as follows: for (int i12 = 0; i12 <= N; i12++)
the array x12 must have at least N+1 elements. Since this is not the case, the loop should be limited to N elements.
3. The variable f is overwritten with each assignment, so the concrete result is irrelevant here. To ensure the compiler does not optimize away this code, f should be used in some manner. One possible approach is to declare f with the volatile keyword.
4. Since the arrays x1 to x12 are not initialized with data, their values are irrelevant.
5. The arrays x1 to x12 can be considered constant since they are only accessed for reading.
6. For precise time measurement in C++, the <chrono> library can be used.
7. Compiler optimization is disabled to ensure that the loops are actually executed.
8. Parallelization with OMP: #pragma omp parallel for reduction(+:f)
If the compiler supports it, all nested loops can be combined with the collapse option. This creates a single large loop with N^12 iterations, distributed across the available threads.
10. To maximize the program's performance and efficiency, it makes sense to limit the number of threads to the number of logical cores of the system.
11.If more than 20 logical cores are available, outer loops can be combined. With 8 threads, the runtime improves only slightly.
C++
for (int i = 0; i < N * N; i++) {
      int i1 = i / N;
      int i2 = i % N;
   }

Up to 5 loops, I do not see a significant advantage from OMP. After that, the computation time can be significantly reduced with OMP. With OMP, the runtime remains in the double-digit second range for 8 loops, whereas without OMP, it would take impractically long to wait for the result. From the 8th loop onward, it also takes a long time with OMP. I would distribute the computations across additional machines at that point.
 
Share this answer
 
Comments
Moharram 4-Aug-24 7:34am    
In fact, it's a pseudocode.
My assumption that we want to get the minimum value of the variable f
merano99 5-Aug-24 9:30am    
Is Moharram Bagheri the same identity as Moharram? If so, why are there two different identities here?
If it is really only about the minimum of f, it would be enough to search for the smallest value in each array.
You don't have to calculate an insane number of combinations in nested slices.
First of all, expand the original question with all the relevant information, or create a new question in which you describe what you intend to do. In addition to the programming language, it would also be helpful to specify which compilers and operating systems are involved.
Your code will crash: because the indexes i1..i12 exceeds the array bounds of 20 elements.
Your code will leave the last loop with only the sum of the last elements of all arrays (MUST i1..i12 < 20) therefore you dont need any loop.
double f   = 0;
double mul = 1;
for( int ix = 0;ix<12;ix++ ){ mul *= 20; } // every element is accumulated 20 times in each loop
for( int ix = 0;ix<20;ix++ )
{
	f +=     x1[ix]  * mul
	      + x2[ix]  * mul
	      + x3[ix]  * mul
	      + x4[ix]  * mul
	      + x5[ix]  * mul
	      + x6[ix]  * mul
	      + x7[ix]  * mul
	      + x8[ix]  * mul
	      + x9[ix]  * mul
	      + x10[ix] * mul
	      + x11[ix] * mul
	      + x12[ix] * mul
	;
}

To illustrate the solution use two arrays of 2 elements:
for( int i1=0;i1<2;i1++ )
{
	for( int i2=0;i2<2;i2++ )
	{
		f += x1[i1] + x2[i2];
	}
}

// same as
f += x1[0] + x2[0];
f += x1[0] + x2[1];
f += x1[1] + x2[0];
f += x1[1] + x2[1];

// same as
for( int ix=0;ix<2;i1++ )
{
	f += 2*x1[ix] + 2*x2[ix];
}


And three arrays with 2 elements:
for( int i1=0;i1<2;ix1++ )
{
	for( int i2=0;i2<2;i2++ )
	{
		for( int i3=0;i3<2;i3++ )
		{
			f += x1[i1] + x2[i2] + x3[i3];
		}
	}
}

// same as
f += x1[0] + x2[0] + x3[0];
f += x1[0] + x2[0] + x3[1];
f += x1[0] + x2[1] + x3[0];
f += x1[0] + x2[1] + x3[1];
f += x1[1] + x2[0] + x3[0];
f += x1[1] + x2[0] + x3[1];
f += x1[1] + x2[1] + x3[0];
f += x1[1] + x2[1] + x3[1];

// same as
for( int ix=0;ix<2;i1++ )
{
	f += 4*x1[ix] + 4*x2[ix] + 4*x3[ix];
}
 
Share this answer
 
v6
It is not clear what you do with f. If it is suitable, I would use OpenMP on a multi-core CPU to do this. You can make the first two loops, the outer ones, parallel loops with OpenMP. It depends entirely on what you do with f though. As the code is, nothing is done with it so it is uncertain if this will be feasible.

For more info look up OpenMP and read up on how its parallel for loops work.
 
Share this answer
 
Comments
Moharram 4-Aug-24 7:21am    
In fact, it's a pseudocode.
My assumption that we want to get the minimum value of the variable f
Moharram 4-Aug-24 7:27am    
can you write the code?
Rick York 4-Aug-24 12:58pm    
Probably. I have written a fair amount of parallel code for both CPUs and GPUs. If you want a minimum value then the usual algorithm is a called a reduction. There have been a lot of reductions written for CUDA that run on GPUs so you should read up on those.

Why are you assuming anything? Assumption is the enemy of functional software.
merano99 4-Aug-24 8:05am    
Hello Rick. My guess is that this is only about improving the runtime, not about concrete values. Neither are the arrays initialised, nor is anything actually done with the result in f.
If you switch on the compiler optimisation, the compiler notices that there is nothing to do. See my comment in Solution 4.
I applied OMP to this task and realised that you would need a lot of logical processors for 12 nested loops. From the 8th loop it takes too long to wait on a desktop system with 8 logical processors even with OMP.
In my opinion, the calculation must be distributed to additional computers, unless you have a server with a huge number of logical processors available.
To speed up an algorithm, start by analyzing its time complexity and identify bottlenecks. Optimize data structures and leverage efficient algorithms for critical operations. Consider parallel processing or hardware acceleration where applicable. Regular profiling and iterative improvements will also ensure sustained performance gains.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900