Speed up an algorithm?

Question

1.00/5 (2 votes)

See more:

I have this algorithm:

C++

double x1[20];

double x2[20];

double x3[20];

double x4[20];

double x5[20];

double x6[20];

double x7[20];

double x8[20];

double x9[20];

double x10[20];

double x11[20];

double x12[20];

for(int i1 = 0;i1<=20;i1++)

for(int i2 = 0;i2<=20;i2++)

for(int i3 = 0;i3<=20;i3++)

for(int i4 = 0;i4<=20;i4++)

for(int i5 = 0;i5<=20;i5++)

for(int i6 = 0;i6<=20;i6++)

for(int i7 = 0;i7<=20;i7++)

for(int i8 = 0;i8<=20;i8++)

for(int i9 = 0;i9<=20;i9++)

for(int i10 = 0;i10<=20;i10++)

for(int i11 = 0;i11<=20;i11++)

for(int i12 = 0;i12<=20;i12++)

     f = x1[i1] + x2[i2] + x3[i3] + x4[i4] + x5[i5] + x6[i6] + x7...

I wanted to increase the speed of its execution. In fact, I have 20 modes to the power of 12!

I tried to use omp, got no response. I read something about BLAS, but I haven't tested it yet.

I took it to GPU and Cuda. Actually, I was trying to give the first ring and the second ring, which are repeated together 400 times, to 400 Cuda Core, so that at least two rings are removed. The speed got worse.
Of course, in the case that the size of the first four arrays is 1, the speed of C++ to execute 20 to the power of 8 is approximately 1300 milliseconds. But this mode is too much to run in a Cuda Core!
Now, if I want to increase the dimensions and, for example, instead of two rings of 400 pieces, give three rings of 8000 pieces to the Cuda!?

What I have tried:

I tried to use omp, got no response. I read something about BLAS, but I haven't tested it yet.

Posted 2-Aug-24 1:07am

Moharram Bagheri

Updated 2-Aug-24 1:22am

CHill60

v2

Add a Solution

Comments

PIEBALDconsult 2-Aug-24 10:25am

I don't see an algorithm.

Moharram 4-Aug-24 7:30am

i want to calculate minimum of variable f.

PIEBALDconsult 4-Aug-24 10:32am

The minimum of f is Double.MinValue

jeron1 2-Aug-24 10:55am

ring == for loop?

0x01AA 2-Aug-24 11:13am

I suggest to google for 'parallelization of nested for' and study the results

Moharram 4-Aug-24 7:26am

Unfortunately, this was my mistake.
The indices are not a problem and should start from 19.

I actually want to get the minimum value of f.

k5054 4-Aug-24 9:32am

Do you need to evaluate over all possible combinations of x1 .. x12? If not, could your innermost loop be expressed in terms of a single array containing all the members of x1 ... x12?

6 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

SG Tamilan · Answer 1 · 2024-08-02T05:00:00

To speed up an algorithm, start by analyzing its time complexity and identify bottlenecks. Optimize data structures and leverage efficient algorithms for critical operations. Consider parallel processing or hardware acceleration where applicable. Regular profiling and iterative improvements will also ensure sustained performance gains.

Rick York · Answer 2 · 2024-08-02T08:18:00

Solution 2

It is not clear what you do with f. If it is suitable, I would use OpenMP on a multi-core CPU to do this. You can make the first two loops, the outer ones, parallel loops with OpenMP. It depends entirely on what you do with f though. As the code is, nothing is done with it so it is uncertain if this will be feasible.

For more info look up OpenMP and read up on how its parallel for loops work.

Posted 2-Aug-24 8:18am

Rick York

Comments

Moharram 4-Aug-24 7:21am

In fact, it's a pseudocode.
My assumption that we want to get the minimum value of the variable f

Moharram 4-Aug-24 7:27am

can you write the code?

Rick York 4-Aug-24 12:58pm

Probably. I have written a fair amount of parallel code for both CPUs and GPUs. If you want a minimum value then the usual algorithm is a called a reduction. There have been a lot of reductions written for CUDA that run on GPUs so you should read up on those.

Why are you assuming anything? Assumption is the enemy of functional software.

merano99 4-Aug-24 8:05am

Hello Rick. My guess is that this is only about improving the runtime, not about concrete values. Neither are the arrays initialised, nor is anything actually done with the result in f.
If you switch on the compiler optimisation, the compiler notices that there is nothing to do. See my comment in Solution 4.
I applied OMP to this task and realised that you would need a lot of logical processors for 12 nested loops. From the 8th loop it takes too long to wait on a desktop system with 8 logical processors even with OMP.
In my opinion, the calculation must be distributed to additional computers, unless you have a server with a huge number of logical processors available.

mbue · Answer 3 · 2024-08-02T09:36:00

Your code will crash: because the indexes i1..i12 exceeds the array bounds of 20 elements.
Your code will leave the last loop with only the sum of the last elements of all arrays (MUST i1..i12 < 20) therefore you dont need any loop.

double f   = 0;
double mul = 1;
for( int ix = 0;ix<12;ix++ ){ mul *= 20; } // every element is accumulated 20 times in each loop
for( int ix = 0;ix<20;ix++ )
{
	f +=     x1[ix]  * mul
	      + x2[ix]  * mul
	      + x3[ix]  * mul
	      + x4[ix]  * mul
	      + x5[ix]  * mul
	      + x6[ix]  * mul
	      + x7[ix]  * mul
	      + x8[ix]  * mul
	      + x9[ix]  * mul
	      + x10[ix] * mul
	      + x11[ix] * mul
	      + x12[ix] * mul
	;
}

To illustrate the solution use two arrays of 2 elements:

for( int i1=0;i1<2;i1++ )
{
	for( int i2=0;i2<2;i2++ )
	{
		f += x1[i1] + x2[i2];
	}
}

// same as
f += x1[0] + x2[0];
f += x1[0] + x2[1];
f += x1[1] + x2[0];
f += x1[1] + x2[1];

// same as
for( int ix=0;ix<2;i1++ )
{
	f += 2*x1[ix] + 2*x2[ix];
}

And three arrays with 2 elements:

for( int i1=0;i1<2;ix1++ )
{
	for( int i2=0;i2<2;i2++ )
	{
		for( int i3=0;i3<2;i3++ )
		{
			f += x1[i1] + x2[i2] + x3[i3];
		}
	}
}

// same as
f += x1[0] + x2[0] + x3[0];
f += x1[0] + x2[0] + x3[1];
f += x1[0] + x2[1] + x3[0];
f += x1[0] + x2[1] + x3[1];
f += x1[1] + x2[0] + x3[0];
f += x1[1] + x2[0] + x3[1];
f += x1[1] + x2[1] + x3[0];
f += x1[1] + x2[1] + x3[1];

// same as
for( int ix=0;ix<2;i1++ )
{
	f += 4*x1[ix] + 4*x2[ix] + 4*x3[ix];
}

merano99 · Answer 4 · 2024-08-04T01:23:00

1. It is poor practice to hard-code the number 20 multiple times in the code. It would be better to define N = 20 in one place.
2. For a for-loop defined as follows: for (int i12 = 0; i12 <= N; i12++)
the array x12 must have at least N+1 elements. Since this is not the case, the loop should be limited to N elements.
3. The variable f is overwritten with each assignment, so the concrete result is irrelevant here. To ensure the compiler does not optimize away this code, f should be used in some manner. One possible approach is to declare f with the volatile keyword.
4. Since the arrays x1 to x12 are not initialized with data, their values are irrelevant.
5. The arrays x1 to x12 can be considered constant since they are only accessed for reading.
6. For precise time measurement in C++, the <chrono> library can be used.
7. Compiler optimization is disabled to ensure that the loops are actually executed.
8. Parallelization with OMP: #pragma omp parallel for reduction(+:f)
If the compiler supports it, all nested loops can be combined with the collapse option. This creates a single large loop with N^12 iterations, distributed across the available threads.
10. To maximize the program's performance and efficiency, it makes sense to limit the number of threads to the number of logical cores of the system.
11.If more than 20 logical cores are available, outer loops can be combined. With 8 threads, the runtime improves only slightly.

C++

for (int i = 0; i < N * N; i++) {
      int i1 = i / N;
      int i2 = i % N;
   }

Up to 5 loops, I do not see a significant advantage from OMP. After that, the computation time can be significantly reduced with OMP. With OMP, the runtime remains in the double-digit second range for 8 loops, whereas without OMP, it would take impractically long to wait for the result. From the 8th loop onward, it also takes a long time with OMP. I would distribute the computations across additional machines at that point.

k5054 · Answer 5 · 2024-08-04T03:23:00

I've had good results with openmp. Assuming we can represent the body of the innermost loop as just a simple math operation we might have something like this:

C

#include <stdio.h>

int main()
{
 double x = 0;
 for(int i1 = 0; i1<20; i1++) {
  printf("i1 = %d\n", i1);
  for(int i2 = 0; i2<20; i2++)
   for(int i3 = 0; i3<20; i3++)
    for(int i4 = 0; i4<20; i4++)
     for(int i5 = 0; i5<20; i5++)
      for(int i6 = 0; i6<20; i6++)
       for(int i7 = 0; i7<20; i7++)
        for(int i8 = 0; i8<20; i8++)
         for(int i9 = 0; i9<20; i9++)
          for(int i10 = 0; i10<20; i10++)
           for(int i11 = 0; i11<20; i11++)
            for(int i12 = 0; i12<20; i12++)
             x += 0.1;
 }

 printf("x = %g\n", x);
 return 0;
}

Compiling and running that on a Ryzen 9 3900x (12C/24T), after at least 8 hours it had not started the second outermost loop. Adding a parallel for loop via OMP we get the following code:

C

#include <stdio.h>
int main()
{
 long double x = 1.0;
#pragma omp parallel for collapse(12)
 for(int i1 = 0; i1<20; i1++) {
  for(int i2 = 0; i2<20; i2++)
   for(int i3 = 0; i3<20; i3++)
    for(int i4 = 0; i4<20; i4++)
     for(int i5 = 0; i5<20; i5++)
      for(int i6 = 0; i6<20; i6++)
       for(int i7 = 0; i7<20; i7++)
        for(int i8 = 0; i8<20; i8++)
         for(int i9 = 0; i9<20; i9++)
          for(int i10 = 0; i10<20; i10++)
           for(int i11 = 0; i11<20; i11++)
            for(int i12 = 0; i12<20; i12++)
             x += 1.0;
 }

 printf("x = %Lg\n", x);
 return 0;
}

Note we've had to remove the printf from the outermost loop so that openmp can do its thing, but that's not going to significantly change the runtime, since the innermost body gets run about 4.0e15 times. Honestly, I did not expect that this would make enough difference that it would run in reasonable time. I was pleasantly surprised to find this ran in under 4 minutes! Of course whether this will work for you or not depends on exactly what the body of the innermost loop is actually doing, and how many threads you have available. But it might be worth trying.

Patrice T · Answer 6 · 2024-08-06T22:25:00

Quote:
Speed up an algorithm?

Difficult to tell since the whole piece of code is incomplete, bugged to the bone and stupid.
As is the sample code can be simplified as:

C++

double x1[20];

double x2[20];

double x3[20];

double x4[20];

double x5[20];

double x6[20];

double x7[20];

double x8[20];

double x9[20];

double x10[20];

double x11[20];

double x12[20];

for(int i1 = 0;i1<=20;i1++)

for(int i2 = 0;i2<=20;i2++)

for(int i3 = 0;i3<=20;i3++)

for(int i4 = 0;i4<=20;i4++)

for(int i5 = 0;i5<=20;i5++)

for(int i6 = 0;i6<=20;i6++)

for(int i7 = 0;i7<=20;i7++)

for(int i8 = 0;i8<=20;i8++)

for(int i9 = 0;i9<=20;i9++)

for(int i10 = 0;i10<=20;i10++)

for(int i11 = 0;i11<=20;i11++)

for(int i12 = 0;i12<=20;i12++)

     f = x1[20] + x2[20] + x3[20] + x4[20] + x5[20] + x6[20] + x7...

Lets guess the missing parts matters.

Speed up an algorithm?

6 solutions

Solution 1

Solution 2

Solution 3

Solution 4

Solution 5

Solution 6

Add your solution here

Preview 0