Click here to Skip to main content
16,004,587 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hi there! I am running monte carlo simulation in C to study a very specific system where there are interactions among atoms and molecules in a square lattice. I wanted to improve a bit the performance of my program because sometimes it takes a lot to run, and I came across several threads that encourage that practice. So I started to use the OpenMP library (I'm not an expert in programming by any means btw). The problem is that it has taken longer than expected and I dare say it has taken longer than if I were not trying to parallelize. I don't know what i am doing wrong

These are the most important parts of my main function:

int main() {
    clock_t start = clock();

    Seed the PCG RNG
{
    pcg32_random_t rng;
    pcg32_srandom_r(&rng, time(NULL) ^ (intptr_t)&printf, (intptr_t)&rng);

    int lattice[64 * 64];  // Changed to 64x64 lattice

    int num_simulations = 50;
    int num_runs = 5; // Number of runs for each value of XCO

  /*
    SOME ARRAYS DEFINITIONS TO STORAGE DATA LATER ON
  */
	
    double COADSORPTION_PROBABILITY = 0.05;
	
    #pragma omp parallel
    {
        pcg32_random_t local_rng;
        int thread_num = omp_get_thread_num();
        pcg32_srandom_r(&local_rng, time(NULL) ^ thread_num, (intptr_t)&local_rng);
        
        #pragma omp for
        for (int i = 0; i < num_simulations; i++) {
            double XCO = i / (double)(num_simulations - 1);  // This ensures XCO ranges from 0 to 1
            XCO_values[i] = XCO;

            // Arrays to store the results of multiple runs for the same value of XCO
          

            for (int run = 0; run < num_runs; run++) {
                for (int j = 0; j < 64 * 64; j++) {
                    lattice[j] = EMPTY;
                }

                int co2_count = 0, n2_count = 0;

                // Warm-up phase
                for (int t = 0; t < 30000; t++) {
                    for (int k = 0; k < 64 * 64; k++) {
                        double r = ldexp(pcg32_random_r(&local_rng), -32);

                        if (r < XCO) {
                            double r1 = ldexp(pcg32_random_r(&local_rng), -32);
                            add_CO(lattice, &co2_count, COADSORPTION_PROBABILITY, r1);
                        } else {
                            add_NO_dissociate(lattice, &co2_count, &n2_count);                
                        }
                    }
                }

                // Data collection phase
                Coverage sum_coverage = {0};
                double sum_co2_rate = 0, sum_n2_rate = 0;
                int data_points = 0;

                for (int t = 0; t < 50000; t++) {
                    for (int k = 0; k < 64 * 64; k++) {
                        double r = ldexp(pcg32_random_r(&local_rng), -32);

                        if (r < XCO) {
                            double r1 = ldexp(pcg32_random_r(&local_rng), -32);
                            add_CO(lattice, &co2_count, COADSORPTION_PROBABILITY, r1);
                        } else {
                            add_NO_dissociate(lattice, &co2_count, &n2_count);  
                        }
                    }

                    if ((t+1) % 10 == 0) {  // Every 5 MCS (0-based indexing)
                        Coverage current_coverage = calculate_coverage(lattice);
                        sum_coverage.CO_coverage += current_coverage.CO_coverage;
                        sum_coverage.O_coverage += current_coverage.O_coverage;
                        sum_coverage.N_coverage += current_coverage.N_coverage;
                        
                        sum_co2_rate += (double)co2_count / (10 * 64 * 64);
                        sum_n2_rate += (double)n2_count / (10 * 64 * 64);
                        
                        co2_count = 0;  // Reset counters every 5 MCS
                        n2_count = 0;
                        
                        data_points++;
                    }
                }

                // Calculate averages
                co_coverages[run] = sum_coverage.CO_coverage / data_points;
                o_coverages[run] = sum_coverage.O_coverage / data_points;
                n_coverages[run] = sum_coverage.N_coverage / data_points;
                co2_rates[run] = sum_co2_rate / data_points;
                n2_rates[run] = sum_n2_rate / data_points;
            }

            // Calculate the mean and standard deviation of the coverages and rates of production for this value of XCO
            co_coverages_mean[i] = mean(co_coverages, num_runs);
            // OTHER COVERAGES MEANS

            co_coverages_std_dev[i] = std_dev(co_coverages, num_runs);
            // OTHER STANDARD DEVIATIONS
        }
    } // End of OpenMP parallel region

//Print the data in a txt file
    for (int i = 0; i < num_simulations; i++) {
        fprintf(file, "%f\t%f\t%f\t%f\t%f\t%f\t%f\t%f\t%f\t%f\t%f\n", XCO_values[i], co_coverages_mean[i], o_coverages_mean[i], n_coverages_mean[i], co2_rates_mean[i], n2_rates_mean[i], co_coverages_std_dev[i], o_coverages_std_dev[i], n_coverages_std_dev[i], co2_rates_std_dev[i], n2_rates_std_dev[i]);
    }

    fclose(file);
    printf("The results were printed in the file\n");
    clock_t end = clock();
    double cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
    printf("Total runtime: %f seconds\n", cpu_time_used);

    return EXIT_SUCCESS;
}


What I have tried:

For more context, my laptop has 2 cores and 4 logic processors, given the info from the task manager (I'm using windows 10).

I am also using an Intel Core i5-4210U 1.7 GHz
Posted
Updated 1-Jul-24 1:24am

An i5-4210U has two cores and can support 4 simultaneous threads - so if you are trying to use more than 3 threads you are probably going to start slowing down the processing as the task switching adds overhead (and the more threads waiting for core time the more overhead you add).

Threading isn't a silver bullet which automagically speeds everything up: it has to be planned carefully in order to work and any thread which grabs a resource the others need is going to prevent them from executing at all. Now, I have no idea how many threads you are running, or what resources each uses - I'm not familiar with the OpenMP library - but if you are getting slower runs than the single threaded approach, it's quite likely that some form of deadlock is forcing threads to wait until a different thread is finished, then fight among themselves to get access to it.
 
Share this answer
 
OpenMP should work OK for your application but to make best use of it you have to inform it of a few things and it does not appear to me that you did this. To be specific, you have to let it know what variables are local to each thread and what ones are shared. This is done to a certain degree automatically but I usually obtain better results when I do it myself. For your code, I would guess it needs to look something like this :
C++
#pragma	omp parallel	default( none ) \
						private( XCO, run, j, t, k, r, sum_coverage, sum_co2_rate, data_points, current_coverage ) \
						shared( i, XCO_values, num_simulations, co_coverages_mean, co_coverages_std_dev, num_runs )
#pragma	omp for
        for (int i = 0; i < num_simulations; i++) {
            double XCO = i / (double)(num_simulations - 1);  // This ensures XCO ranges from 0 to 1
            XCO_values[i] = XCO;

            // Arrays to store the results of multiple runs for the same value of XCO
          

            for (int run = 0; run < num_runs; run++) {
                for (int j = 0; j < 64 * 64; j++) {
                    lattice[j] = EMPTY;
                }

                int co2_count = 0, n2_count = 0;

                // Warm-up phase
                for (int t = 0; t < 30000; t++) {
                    for (int k = 0; k < 64 * 64; k++) {
                        double r = ldexp(pcg32_random_r(&local_rng), -32);

                        if (r < XCO) {
                            double r1 = ldexp(pcg32_random_r(&local_rng), -32);
                            add_CO(lattice, &co2_count, COADSORPTION_PROBABILITY, r1);
                        } else {
                            add_NO_dissociate(lattice, &co2_count, &n2_count);                
                        }
                    }
                }

                // Data collection phase
                Coverage sum_coverage = {0};
                double sum_co2_rate = 0, sum_n2_rate = 0;
                int data_points = 0;

                for (int t = 0; t < 50000; t++) {
                    for (int k = 0; k < 64 * 64; k++) {
                        double r = ldexp(pcg32_random_r(&local_rng), -32);

                        if (r < XCO) {
                            double r1 = ldexp(pcg32_random_r(&local_rng), -32);
                            add_CO(lattice, &co2_count, COADSORPTION_PROBABILITY, r1);
                        } else {
                            add_NO_dissociate(lattice, &co2_count, &n2_count);  
                        }
                    }

                    if ((t+1) % 10 == 0) {  // Every 5 MCS (0-based indexing)
                        Coverage current_coverage = calculate_coverage(lattice);
                        sum_coverage.CO_coverage += current_coverage.CO_coverage;
                        sum_coverage.O_coverage += current_coverage.O_coverage;
                        sum_coverage.N_coverage += current_coverage.N_coverage;
                        
                        sum_co2_rate += (double)co2_count / (10 * 64 * 64);
                        sum_n2_rate += (double)n2_count / (10 * 64 * 64);
                        
                        co2_count = 0;  // Reset counters every 5 MCS
                        n2_count = 0;
                        
                        data_points++;
                    }
                }

                // Calculate averages
                co_coverages[run] = sum_coverage.CO_coverage / data_points;
                o_coverages[run] = sum_coverage.O_coverage / data_points;
                n_coverages[run] = sum_coverage.N_coverage / data_points;
                co2_rates[run] = sum_co2_rate / data_points;
                n2_rates[run] = sum_n2_rate / data_points;
            }

            // Calculate the mean and standard deviation of the coverages and rates of production for this value of XCO
            co_coverages_mean[i] = mean(co_coverages, num_runs);
            // OTHER COVERAGES MEANS

            co_coverages_std_dev[i] = std_dev(co_coverages, num_runs);
            // OTHER STANDARD DEVIATIONS
        }
I am not sure I covered them all but I am close. As I mentioned, a private variable means each thread gets its own copy and shared variable means it is shared between all threads. The thing is, if OpenMP thinks a variable is shared it will put locks around accesses of it so if you have private variables that it thinks are shared it will hurt performance. This is why it is important to flag which variables are private to a thread and which ones are shared between threads.
 
Share this answer
 
Creating, managing and synchronizing threads causes a certain overhead that negates potential performance gains, especially for relatively small problem sizes and frequent synchronization points. If the work is not distributed evenly among the threads and threads have to wait for shared resources, performance suffers considerably.

Given the small number of logical processors available, the number of parallel regions should be deliberately kept low and, in particular, the creation of threads within loops should be avoided. It is also imperative to minimize access to shared memory, e.g. by using local variables within the parallel regions.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900