Introduction
I've tried to find a simple sample for OpenMP (multi-platform shared-memory parallel programming in C/C++ and Fortran on all architectures, http://openmp.org/wp/) image processing without success. So I decided to dig into the details on my own. This is a very short article since there is a lot of further stuff out there to read. I do not place the complete project here since it is only a console sample used.
However, here are the results I would like to share with you.
Using the code
As first we have to find out how many physical cores are present (using all cores will reduce performance). This can be done by calling GetLogicalProcessorInformation
and counting the processor cores. The number of processors will be set to the OpenMP library by calling omp_set_num_threads()
.
int OMP_Prepare()
{
SYSTEM_INFO sysinfo;
GetSystemInfo( &sysinfo );
DWORD numCPU = sysinfo.dwNumberOfProcessors;
LPFN_GLPI glpi;
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer = NULL, ptr = NULL;
glpi = (LPFN_GLPI) GetProcAddress(
GetModuleHandle(TEXT("kernel32")),
"GetLogicalProcessorInformation");
DWORD processorCoreCount = 0;
if (glpi != NULL)
{
DWORD returnLength = 0;
glpi(NULL, &returnLength);
buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION)malloc(returnLength);
glpi(buffer, &returnLength);
ptr = buffer;
DWORD byteOffset = 0;
while (byteOffset + sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION) <= returnLength)
{
switch (ptr->Relationship)
{
case RelationProcessorCore:
processorCoreCount++;
break;
default:
break;
}
byteOffset += sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION);
ptr++;
}
delete(buffer);
}
if(processorCoreCount == 0)
{
processorCoreCount = numCPU / 2; if(processorCoreCount == 0)
processorCoreCount = 1;
}
omp_set_num_threads(processorCoreCount);
return processorCoreCount;
}
Most of the code samples work with arrays within the loop, which is more time consuming than using a pointer. Usually the OpenMP uses a construct like 'parallel for'. Initializing a pointer is not possible after this line of code, since the compiler requests a for(...)
loop.
#pragma omp parallel for
for(run = 0; run < size; run++)
pwdata[index++] <<= 4;
I found out that it is possible to initialize a pointer, befor running into the loop, when separating the 'parallel
' and the 'for
' pragmas:
An OMP thread has to be set up to work with a part of an image when using pointers.
#pragma omp parallel shared (shared) private(private)
{ ... do the thread stuff here
}
Inside the braces you can setup the loop as proposed by the OpenMP samples.
After the opening brace you can set the pointer to the individual threads position using the thread id and the size of the image array. Thus each thread gets its' own area to work with. Consider we have core i7 CPU, which has got four dual core CPUs. The OMP_Prepare function returns 'four', which is the number of physical processors available. We'll get four threads due to this initialization. The area to be processed will be covered by four threads then. Now we have to split the array to four blocks. The block size and the image size is common for all threads, marked as 'shared'. The data pointer and the run index are private for each thread. After the first pragma each thread gets its own individual data pointer, based on the size_per_thread and the thread id, which is a current number (0,1,2,3 in our case with four physical CPUs).
The complete function looks like this:
void OMP_Work(WORD* datain, long xsize, long ysize, int cpucnt)
{
unsigned short *pwdata;
long run = 0;
long size = xsize * ysize;
long size_per_thread = size / cpucnt;
int id;
#pragma omp parallel shared (size_per_thread, size) private(id, pwdata, run)
{ id = omp_get_thread_num(); pwdata = datain + size_per_thread * id;#pragma omp for
for(run = 0; run < size; run++)
*pwdata++ <<= 4; }
#pragma omp barrier }
Here's the function used to get the time consumption:
In order to measure time I use the Performance Counter. The funciton has to be called twice: First to initialize and then a second time to query the time. Initially we start with dTTF = 0, which calls the internal initialization. The init part queries the frequency and the current counter. After this we are ready for the next call, which gets the time passed by.
double TT()
{
__int64 T1, T2;
double dT;
static double dTTF = 0.0;
static LARGE_INTEGER LITime1, LITime2;
if(dTTF == 0.0)
{
LARGE_INTEGER LIFrequ;
QueryPerformanceFrequency(&LIFrequ);
QueryPerformanceCounter(&LITime1);
dTTF = (double)LIFrequ.QuadPart;
return 0.0;
}
QueryPerformanceCounter(&LITime2);
T1 = LITime1.QuadPart;
T2 = LITime2.QuadPart;
dT = (double)T2 - (double)T1;
dT /= dTTF;
LITime1.QuadPart = LITime2.QuadPart;
return dT;
}
Here's the usage:
First we call the prepare function, which will get the number of physical processors. This number of processors is set to the OMP library. Now we create a dummy array, which will be filled with dummy data. Then we call the true time function (TT) in order to initialize the internal time variables.
Work is called in order to process the data. Currently the 'work' part is only a shift. After 'Work' is finished we call the
TT() again to get the time consumed for processing.
int main(int argc, char* argv[])
{
int iCPU = OMP_Prepare();
ix = 2560; iy = 2160; ps = new WORD[ix * iy];
double dt;
for(int jjj = 0; jjj < ix * iy; jjj++) ps[jjj] = 20;
TT();
Work(ps, ix, iy, iCPU);
dt = TT();
delete[] ps;
}
The loop runs about two times faster on a Core i7 than a loop without OMP stuff. Probably the performance gain will be higher when doing more calculations than just shifting a WORD
.