I tried the posted code.
My idea was to obtain partial sums of input data on array rms, then make barriers (GLOBAL and LOCAL) to wait until all rms[k] are filled, then sum all them to obtain the media value.
I placed some printf to advises if there are errors in the calculus.
I obtained errors at printf warning-2 but not warning-1 and 3 due when adding all data, bacause some of the cores still not finished to calculate the partial sums.
I did not used local memory as long as maximum local size is 256 that is lot smaller than height=10000
How I make GPU to wait until I calculated all partial sums?
What I have tried:
I have the following code:
__kernel void hallaRMS2(
__global float* data,
int WIDTH,
int HEIGHT,
__global double *rms
)
{
int k = get_global_id(0);
__global float *x=data+k*WIDTH;
double sum=0.0;
for (int j=0;j<WIDTH;j++)
{
sum+=x[j];
}
rms[k]=sum;
if ((rms[k]<100*WIDTH)||(rms[k]>101*WIDTH)) printf("Warning-1: rms[%i]=%lg\n",k,rms[k]);
barrier(CLK_GLOBAL_MEM_FENCE );
barrier(CLK_LOCAL_MEM_FENCE );
if (k==0)
{
sum=0.0;
for (int j=0;j<HEIGHT;j++)
{
if ((rms[j]<100*WIDTH)||(rms[j]>101*WIDTH)) printf("Warning-2: rms[%i]=%lg\n",j,rms[j]);
sum+=rms[j];
}
rms[0]=sum/(double) WIDTH/(double) HEIGHT;
printf("GPU sum=%lg\n",sum);
printf("GPU media=%lg\n",rms[0]);
}
else
if ((rms[k]<100*WIDTH)||(rms[k]>101*WIDTH)) printf("Warning-3: rms[%i]=%lg\n",k,rms[k]);
...