Click here to register and download your free 30-day trial of Intel® Parallel Studio XE.
Michael Klemm, Senior Staff Application Engineer, Intel Deutschland GmbH; Alejandro Duran, Application Engineer, Intel Corporation Iberia; Ravi Narayanaswamy, Senior Staff Development Engineer; Xinmin Tian, Senior Principal Engineer; and Terry Wilmarth, Senior Development Engineer, Intel Corporation
There are two decades of history associated with the OpenMP* API and, since its inauguration, OpenMP features have been added to keep up with developments in hardware and software to ensure that you can use it to program the hardware that you have. Since the release of version 4.0 in 2013, the OpenMP language has supported heterogeneous and SIMD programming. Similarly, support for programs with irregular parallelism was improved in 2008 with the addition of tasking constructs. OpenMP Technical Report 4: Version 5.0 Preview 1 (TR4 for short) is the next step in the evolution of the OpenMP language. It adds task reductions, extends SIMD parallel programming, and considerably extends the productivity of heterogeneous programming. In this article, we review existing OpenMP features and provide a preview of what will be coming soon in implementations supporting TR4.
Tasking: Express Yourself with Tasks
Tasking, or task-based programming, is an important concept for applications that require irregular parallelism (e.g., recursive algorithms, graph traversals, and algorithms operating on unstructured data). Since OpenMP version 3.0, the task construct has provided a convenient way to express the concurrent execution of small units of work that are handed to a scheduler in the OpenMP runtime system.
void taskloop_example() {
#pragma omp taskgroup
{
#pragma omp task
long_running_task()
#pragma omp taskloop collapse(2) grainsize(500) nogroup
for (int i = 0; i < N; i++)
for (int j = 0; j < M; j++)
loop_body();
}
}
Figure 1. Simple example using the new taskloop construct together with an OpenMP* task
Figure 1 illustrates the creation of an OpenMP task to execute a long-running function and then a loop that has been parallelized using the taskloop
construct. This construct appeared with OpenMP 4.5 and provides syntactic sugar to allow programmers to easily parallelize loops using OpenMP tasks. It divides the loop iteration space into chunks and creates one task for each chunk. The construct supports several clauses to allow fine control (e.g., grainsize
to control the amount of work per task and collapse to create a product loop out of the i
and j
loops). TR4 extends the expressiveness of OpenMP tasks by defining new clauses for the taskgroup
, task
, and taskloop
constructs to perform reductions across the generated tasks.
Figure 2 illustrates the creation of tasks to process a linked list and find the minimum value of all elements in the list. The parallel
construct creates a parallel region to have worker threads available for task execution. The single
construct then restricts execution to one thread that traverses the linked list and generates one task for each list item via omp task
. This is a common way to implement a producer-consumer pattern in OpenMP.
Task reductions in TR4 use the taskgroup construct that was introduced in OpenMP version 4.0. It was designed to group tasks logically and to provide a way to await completion of all the tasks in the group. TR4 extends the taskgroup construct to perform reductions through the task_reduction
clause, as illustrated in Figure 2. If this clause is added to the construct, all partial results gathered by the individual tasks are aggregated to form the final result at the end of the taskgroup
region. Tasks that contribute to the reduction operation must have an in_reduction
clause that matches the reduction clause of their taskgroup
.
Starting with TR4, the taskloop
construct supports the reduction
and in_reduction
clauses with their task reduction semantics. If a reduction
clause appears on the taskloop construct, an implicit task group is created, which performs the requested reduction operation at the end of the loop. If an in_reduction
clause is added, the tasks generated by the taskloop
construct participate in the reduction of an outer taskgroup region.
int find_minimum(list_t * list) {
int minimum = INT_MAX;
list_t * ptr = list;
#pragma omp parallel
#pragma omp single
#pragma omp taskgroup task_reduction(min:minimum)
{
for (ptr = list; ptr ; ptr = ptr->next) {
#pragma omp task firstprivate(ptr) in_reduction(min:minimum)
{
int element = ptr->element;
minimum = (element < minimum) ? element : minimum;
}
}
}
return minimum;
}
Figure 2. Traversing a linked list and computing the minimum using task reductions
Offloading: Making the Most of Coprocessors
The OpenMP API strives to improve the usability of offloading pragmas based on user feedback. To that end, new features have been added to TR4 and some existing features have been enhanced. One of the key new features is the ability to automatically detect functions used in offload regions and treat them as if they appeared in a declare
target
directive. Previously, all functions called in an offload region had to be explicitly tagged using declare target
directives. This was hard work, especially if the routines were in header files not owned by the programmer (e.g., the Standard Template Library), and would require declare
target
directives on the header file itself, which would create a copy of every function in the header file for the device even if some functions were not used in the offload region.
#pragma omp declare target
void foo() {
}
#pragma omp end declare target
void bar() {
#pragma omp target
{
foo();
}
}
|
void foo() {
}
void bar() {
#pragma omp target
{
foo();
}
}
|
Figure 3. Automatically detect functions used in offload regions
In Figure 3, the code on the left shows what was required in OpenMP version 4.5. Starting with TR4, the code on the right side is sufficient due to the implicit detection and creation of the device function.
Automatic detection also extends to variables with static storage duration in TR4. The examples in Figure 4 are equivalent.
int x;
#pragma omp declare target to (x)
void bar() {
#pragma omp target
{
x = 5;
}
}
|
int x;
void bar() {
#pragma omp target
{
x = 5;
}
}
|
Figure 4. Automatic detection for variables with static storage duration
OpenMP version 4.5 introduced the use_device_ptr
clause. The variable in use_device_ptr
must be mapped before it can be used. To achieve this, the programmer would need to use a separate #pragma
target data
clause, as a variable can appear in only one data clause. Thus, the OpenMP directives in Figure 5 are needed.
#pragma omp target data map(buf)
#pragma omp target data use_device_ptr(buf)
Figure 5. Mapping the variable
In TR4, an exception has been made so that the variable can appear in both the map and use_device_ptr
clauses in a single construct, as shown in Figure 6.
#pragma omp target data map(buf) use_device_ptr(buf)
Figure 6. Variable can appear in both the map and use_device_ptr clauses
Static data members are now permitted in a class inside an omp declare target construct. Class objects with static members can also be used in a map clause (Figure 7).
#pragma omp declare target
class C {
static int x;
int y;
}
class C myclass;
#pragma omp end declare target
void bar() {
#pragma omp target map(myclass)
{
myclass.x = 10
}
}
Figure 7. Class objects with static members used in a map clause
In addition, virtual member functions are allowed in classes inside an omp declare target
construct or objects used in a map clause. The only caveat is that the virtual member functions can be invoked only on a device if the object is created on the same device.
In OpenMP 4.5, scalar variables used in a reduction
or lastprivate
clause on a combined construct for which the first construct is target are treated as firstprivate
for the target construct. That results in the host value never being updated, surprisingly. To update the value on the host, the programmer had to separate the omp target
directive from the combined construct and explicitly map the scalar variable. In TR4, such variables are automatically treated as if they had a map(tofrom:variable
) applied to them.
If a section of a named array is mapped using omp
target data
, any nested omp target
inside the omp target data
construct that references the array would require an implicit mapping to either the same section or a subsection of the array used in the outer omp target data
map clause. If the explicit mapping is omitted on the inner omp target
region, the implicit mapping rule kicks in, which would imply that the entire array is mapped according to OpenMP version 4.5. This would result in a runtime error from mapping a larger-sized array when a subsection of the array is already mapped. Similarly, mapping a field of a structure variable in the outer omp target data
construct and using the address of the structure variable inside a nested omp target
construct would result in an attempt to map the entire structure variable when part of the structure is already mapped. TR4 has fixed these cases to give the behavior that programmers typically expect (Figure 8).
struct {int x,y,z} st;
int A[100];
#pragma omp target data map(s.x A[10:50])
{
#pragma omp target
{
A[20] = ;
foo(&st);
}
#pragma omp target map(s.x, A[10:50])
{
A[20] = ;
foo(&st);
}
}
Figure 8. Improved mapping
The new features in TR4 improve the programmability of offloading using OpenMP, requiring fewer modifications to the application. The automatic detection of variables and functions used in target
regions removes the need for explicit specification. Similarly, the elimination of the need to repeat map clauses inside nested regions and allowing variables to appear in both map and use_device_ptr
reduces the number of OpenMP directives required. The changes to the behavior of reduction variables aligns the language with programmer expectations. Overall, the cleaner semantics make the use of offload devices within OpenMP applications simpler and more intuitive.
Efficient SIMD Programming
SIMD Loops with Cross-Iteration Dependencies
OpenMP version 4.5 extends the ordered construct by adding a new simd clause. The ordered simd construct declares that a structured block in the SIMD loop or SIMD function must be executed in iteration order or in the order of function calls, respectively. Figure 9 shows the use of the ordered simd block to preserve read-write, write-read, and write-write ordering within each iteration and among iterations, while the entire loop can be executed concurrently using SIMD instructions. In the first ordered simd
block, the index ind[i]
of array a may have a write-write conflict (e.g., ind[0] = 2
, ind[2] = 2
), so it needs to be serialized by the ordered simd
to allow vectorization of the entire loop. In the second ordered simd
block, the myLock(L) and myUnlock(L) operations must be in a single ordered simd block. Otherwise, as part of the loop vectorization (e.g., for a vector length of two), the calls to myLock(L)
and myLock(L)
will be expanded to two calls as follows: {myLock(L); myLock(L); …; myUnlock(L); myUnlock(L);}
. Nesting the lock functions will typically result in a deadlock. The ordered simd construct shown in the example creates the proper sequence {myLock(L); …; myUnlock(L); …; myLock(L); myUnlock(L);}
.
#pragma omp simd
for (i = 0; i < N; i++) {
#pragma omp ordered simd
{
a[ind[i]] += b[i];
}
#pragma omp ordered simd
{
myLock(L)
if (x > 10) x = 0;
myUnlock(L)
}
}
|
#pragma omp simd
for (i = 0; i < N; i++) {
#pragma omp ordered simd
{
if (c[i]) > 0) q[j++] = b[i];
}
#pragma omp ordered simd
{
if (c[i] > 0) q[j++] = d[i];
}
}
|
Figure 9. Preserving read-write, write-read, and write-write ordering within each iteration and among iterations
When using the simd
clause on the ordered
construct, caution is required to not violate inherent dependencies between two ordered simd
blocks. Figure 9 shows incorrect uses of #pragma omp ordered simd
, as the order of stores is changed under SIMD execution with respect to its serial execution. Assume c[0] = true and c[1] = true
. When the above loop is executed serially, the order of stores is: q[0] = b[0], q[1] = d[0], q[2] = b[1], q[3] = d[1]
, and so forth. However, when the loop is executed concurrently with a vector length of two, the order of stores is: q[0] = b[0], q[1] = b[1], q[2] = d[0], q[3] = d[1]
, . . . The change in store ordering is due to a violation of the write-to-read dependency on the variable j
between the two ordered simd
blocks in the loop. The correct use is to merge the two ordered simd
blocks into a single ordered simd
block.
REF/UVAL/VAL Modifier Extensions to the Linear Clause
The linear
clause provides a superset of the functionality provided by the private
clause. When a linear
clause is specified on a construct, the value of the new list item on each iteration of the associated loop(s) corresponds to the value of the original list item before entering the construct, plus the logical number of the iterations multiplied by the linear step. The value corresponding to the sequentially last iteration of the associated loop(s) is assigned to the original list item. When a linear
clause is specified on a declarative directive, all list items must be formal parameters (or, in Fortran, dummy arguments) of a function that will be invoked concurrently on each SIMD lane.
The rationale behind adding ref
/uval
/val
modifiers to the linear clause is to provide a way for programmers to precisely specify the linear
or uniform
property of memory references with respect to address and data value so the compiler can leverage the information to generate efficient SIMD code using unit-stride loads/stores instead of gathers/scatters. Essentially, for implicitly referenced linear arguments, it would be better to have reference as linear. The semantics of uval
/val
/ref
is described as:
linear(val(var):[step])
indicates that the value is linear even if the var is passed by its reference. The vector of addresses is passed for passed by reference. In this case, the compiler must generate gathers or scatters.
linear(uval(var):[step])
indicates that the value passed by reference is linear while the reference itself is uniform. So the reference to the first lane is passed, but other values can be constructed using step. The compiler can use general-purpose registers to pass the base address and compute its linear value.
linear(ref(var):step)
indicates that the parameter is passed by reference, the underlying reference is linear, and the memory access will be linear unit-stride or nonstrided depending on step. The compiler can use general-purpose registers to pass the base address and compute its linear address.
Figure 10 shows a function FOO with arguments X and Y, which are pass-by-reference in Fortran. The “VALUE
” attribute does not change this behavior. It says only that the updated value will not be visible to the caller per the Fortran 2008 language specification. Since the references of X and Y are not annotated as linear, the compiler must generate gather instructions to load (X0, X1, X2, X3) and (Y0, Y1, Y2, Y3), assuming the vector length is four. In Figure 11, the references to X and Y are annotated as linear, so the compiler can generate unit-stride SIMD loads for much better performance.
REAL FUNCTION FOO(X, Y)
!$omp declare simd(FOO)
REAL, VALUE :: Y !! pass by reference
REAL, VALUE :: X !! pass by reference
FOO = X + Y !! gathers generated
!! based on vector
!! of addresses
END FUNCTION FOO
! ...
!omp$ simd private(X,Y)
DO I= 0, N
Y = B(I)
X = A(I)
C(I) += FOO(X, Y)
ENDDO
Figure 10. Linearity of reference X and Y is unknown at compile time
REAL FUNCTION FOO(X, Y)
!$omp declare simd(FOO) linear(ref(X), ref(Y))
REAL, VALUE :: Y !! pass by reference
REAL, VALUE :: X !! pass by reference
FOO = X + Y !! unit stride
!! SIMD loads
END FUNCTION FOO
! ...
!omp$ simd private(X,Y)
DO I= 0, N
Y = B(I)
X = A(I)
C(I) += FOO(X, Y)
ENDDO
Figure 11. References to X and Y annotated as linear
In Figure 12, the function add_one is annotated as a SIMD function. It has a C++ reference argument const int &p
. Assuming a vector length of four, if p is annotated as linear(ref(p))
, the compiler can generate the unit-stride load instruction with the base address p in the rax register to load p[0]
, p[1]
, p[2]
, and p[3]
to the xmm0
register. In that case, the add_one
function requires only three instructions.
#pragma omp declare simd notinbranch // linear(ref(p))
__declspec(noinline)
int add_one(const int& p) {
return (p + 1);
}
Figure 12. SIMD code comparison with and without linear(ref(p)) annotation
However, if p
is not annotated as linear(ref(p))
, the compiler has to assume that four different addresses p0
, p1
, p2
, and p3
are passed in via two xmm
registers, and the gather operation is emulated with a sequence of scalar load and packing instructions. As a result, the add_one
function now requires 16 instructions rather than three.
Overall, the additional SIMD features in OpenMP version 4.5 allow the user to provide more information to the compiler, which allows vectorization of more loops and the generation of better vector code in many circumstances.
Affinity: Thread Placement Made Easy
The OpenMP version 4.0 specification gave users a standard way to control thread affinity for the first time. It introduced two new concepts to the language:
- Binding policy
- Place partition
The binding policy, specified by the bind-var Internal Control Variable (ICV), determines where the threads of a team will be bound relative to the parent thread’s place. The place partition, specified by the place-partition-var ICV, is the set of places to which threads can be bound. Once a thread is bound to a place for a given team, it should not be moved from that place.
There are three binding policies defined by the specification: master
, close
, and spread
. In describing these policies, we will consider a set of four places, each one a core with two threads. We will show examples of placing three threads and six threads on those places, and we assume that the parent thread will always be on the third place. In the master
policy, the master thread is bound to the parent thread’s place, and then the remaining threads in the team are assigned to the same place as the master thread (Table 1).
Table 1. Thread placement for the master policy
| Place 1: {0,1} |
Place 2: {2,3}
| Place 3: {4,5} (parent) | Place 4: {6,7} |
Three threads |
|
| 0, 1, 2 |
|
Six threads
|
|
| 0, 1, 2, 3, 4, 5 |
|
The close
policy starts by placing the master thread on the parent thread’s place, and then proceeds in a round-robin fashion with the remaining threads in the team. To place T threads on P places, the master’s place gets roughly the first T/P threads, then the next place in the place partition gets the next T/P threads, and so on, wrapping around in the place partition as needed, giving a distribution (Table 2).
Table 2. Thread placement for the close policy
| Place 1: {0,1} | Place 2: {2,3} | Place 3: {4,5} (parent) | Place 4: {6,7} |
Three threads | 2 |
| 0 | 1 |
Six threads | 4 | 5 | 0,1 | 2,3 |
With the spread
policy, things get very interesting. The placement of threads will be such that they are spread out over the available places. This is accomplished by forming T roughly even subpartitions of the place partition, or P partitions if T >= P. If T < = P, each thread gets its own subpartition, starting with the master thread, which will get the subpartition containing the place to which the parent thread is bound. Each subsequent thread is bound to the first place in each subsequent subpartition, wrapping around as needed. If T > P, sets of consecutive threads get the same subpartition, which in this case will consist of a single place. Thus, all the threads in the set will be bound to the same place. We show the subpartitions formed in Table 3 in curly braces. These are important if nested parallelism is used, since they affect the available resources used by each nested parallel region.
Table 3. Thread placement and subpartitions for the spread policy
| Place 1: {0,1} | Place 2: {2,3} | Place 3: {4,5} (parent) | Place 4: {6,7} |
Three threads | 1 {{0,1}} | 2 {{2,3}} | 0 {{4,5},{6,7}} Note: 0 is bound to {4,5} |
Six threads | 4 {{0,1}} | 5 {{2,3}} | 0,1 {{4,5}} | 2,3 {{6,7}} |
OpenMP version 4.0 also provides a query function for the thread affinity binding policy: omp_proc_bind_t omp_get_proc_bind()
. It returns the binding policy to be used in the next parallel
region (assuming that no proc_bind
clause is specified on that region).
What is really interesting about the spread policy is what happens with the subpartition. With the master
and close
policies, each implicit task inherits the place partition of the parent implicit task. But, in the spread
policy, implicit tasks get their place-partition-var ICV set to the subpartition instead. This means that a nested parallel
construct will have all of its threads placed within the subpartition of its parent.
The value of bind-var can be initialized via the environment variable OMP_PROC_BIND
. The value of bind-var can also be overridden by the addition of a proc_bind clause to a parallel
construct. Specifying the place-partition-var is accomplished via the OMP_PLACES
environment variable. Places can be hardware threads, cores, sockets, or specific quantities of those. They can also be explicit processor lists. More details can be found in the OpenMP API specification.
The OpenMP 4.5 specification enhanced the language’s affinity capabilities by providing a set of functions capable of querying aspects of the place partition and binding place of the current thread. These new API functions are useful for confirming the correctness of the settings to achieve the programmer’s desired thread affinity. This is particularly important when the complexity of the code is high and nested parallelism is used in conjunction with the spread binding policy to place threads in nested parallel regions such that they share lower-level caches. These API functions are:
int omp_get_num_places()
: Returns the number of places in the place-partition-var in the execution environment of the initial task.
int omp_get_place_num_procs(int place_num)
: Returns the number of processors available to the execution environment in the place specified by place_num in the place partition.
void omp_get_place_proc_ids(int place_num, int *ids)
: Gets the processors available to the execution environment in the place specified by place_num in the place partition, allocates an array to hold them, and puts that array at ids.
int omp_get_place_num(void)
: Returns the number of the place in the place partition to which the encountering thread is bound.
int omp_get_partition_num_places(void)
: Returns the number of places in the place partition of the innermost implicit task. Note that this differs from omp_get_num_places()
in that it will show the effects of the spread binding policy as the place partition gets broken into subpartitions, whereas omp_get_num_places()
will always show the full original place partition.
void omp_get_partition_place_nums(int *place_nums)
: Gets the list of place numbers corresponding to the place partition of the innermost implicit task and allocates an array in place_nums to store them. Note that the place numbers are the numbers of the places in the full original place partition. This function is particularly useful to see which places from the original place partition appear in a subpartition resulting from the use of the spread binding policy.
Outlook on the OpenMP API Specification, Version 5.0
The OpenMP Architecture Review Board (ARB) is discussing other features that could appear in version 5.0 of the OpenMP specification. The most likely candidates are described in this section.
Memory Management Support
How to support increasingly complex memory hierarchies within OpenMP applications is an area of active discussion. This complexity comes from multiple directions: new memories with different characteristics (such as MCDRAM on Intel® Xeon Phi™ Processors or Intel® 3D XPoint™ memory), the need to request certain characteristics of the allocated memory to ensure good performance (e.g., certain alignments or page sizes), the need for special compiler support for some memories, NUMA effects, etc. Furthermore, many new memory technologies are being investigated, so any proposal needs to be extensible to handle future technologies.
The current working direction is based on two key concepts that try to model the different technologies and operations: memory spaces and allocators. Memory spaces represent system memory with a set of traits (e.g., page size, capacity, bandwidth, etc.) that programmers can specify to find the memory that they want their program to use. Allocators are objects that allocate memory from a memory space and can also have traits that alter their behavior (e.g., the alignment of allocations).
New APIs are defined to manipulate memory spaces and allocators, and to allocate and deallocate memory. Separating the calls for creating an allocator and allocating memory allows the construction of maintainable interfaces where decisions about where memory should be allocated are taken in a common “decision” module. Figure 13 shows how the proposal can be used to select the memory with the highest bandwidth in the system that uses 2 MB pages and defines two different allocators from that memory: one that ensures allocations are 64-byte-aligned and one that does not.
omp_memtrait_set_t trait_set;
omp_memtrait_t traits[] = {{OMP_MTK_BANDWIDTH,OMP_MTK_HIGHEST},
{OMP_MTK_PAGESIZE, 2*1024*1024}};
omp_init_memtrait_set(&trait_set,2,traits);
omp_memspace_t *amemspace = omp_init_memspace(&trait_set);
omp_alloctrait_t trait = {{OMP_ATK_ALIGNMENT},{64}};
omp_alloctrait_set_t trait_set;
omp_init_alloctrait_set(&trait_set,1,&trait);
omp_allocator_t *aligned_allocator = omp_init_allocator(amemspace,
&trait_set);
omp_allocator_t *unaligned_allocator = omp_init_allocator(amemspace, NULL);
double *a = (double *) omp_alloc( aligned_allocator, N * sizeof(double) );
double *b = (double *)omp_alloc( unaligned_allocator, N * sizeof(double) );
Figure 13. Selecting the memory with the highest bandwidth
A new allocate directive is proposed to affect the underlying allocation of variables that are not allocated with an API call (e.g., automatic or static variables). A new allocate clause can be used to affect the allocations done by OpenMP directives (e.g., private copies of variables). Figure 14 shows how to use the directive to change the allocation of variables a and b to a memory with the highest bandwidth that also uses 2 MB pages. The private copies of b in the parallel region of the example are allocated on a memory with the lowest latency.
int a[N], b[M];
#pragma omp allocate(a,b) memtraits(bandwidth=highest, pagesize=2*1024*1024)
void example() {
#pragma omp parallel private(b) allocate(memtraits(latency=lowest):b)
{
}
}
Figure 14. Changing the allocation of variables a and b to a memory with the highest bandwidth that also uses 2 MB pages
Improvements for Heterogeneous Programming
Several features are being considered to improve OpenMP’s device support:
- Currently, structures in a map clause are bitwise copied, including pointer fields in a structure. If the programmer requires the pointer field to point to valid device memory, that would require creating the device memory and explicitly updating the pointer field on the device. The committee is discussing extensions that would enable the programmer to specify the automatic attaching/detaching of the pointer fields in a structure using a
map
clause by extending it to support pointer fields in structures.
- The ARB is considering allowing function pointers to be used in a
target
region and also allowing a function pointer to appear in declare target
.
- New device memcpy routines that can execute asynchronously.
- Support to enable an “execute on device or fail” semantics of the target construct. Currently, target regions can execute silently on the host when the device is not available.
- Support for variables and functions that only exist on devices and are not copies of host-based ones.
- Support for multiple types of devices in a single application.
Tasking Improvements
Beyond the features discussed in this article, other features being considered for OpenMP 5.0 are:
- Enabling data dependencies between taskloop constructs.
- Enabling data dependencies in both the task construct and
taskloop
construct to contain expressions that could be expanded to multiple values to generate more than dependence from a single depend clause.
- Providing support to express task to thread affinity with patterns similar to those of
OMP_PROC_BIND
.
Other Changes
Other areas where there are active discussions of additional features for OpenMP 5.0 include:
- Upgrade the OpenMP base language specification to C11, C++11, or C++14, and Fortran 2008.
- Relax the restrictions of the collapse clause to allow nonrectangular loop shapes and allow code to appear between the loops in the nest.
- Allow reductions to happen in the middle of a parallel region without being associated with a work-sharing construct.
The OpenMP API is the gold standard of a portable and vendor-neutral parallel programming language for shared memory parallelization of C, C++, and Fortran applications in high-performance computing and beyond. And with upcoming developments in version 5.0, it promises to offer much more for developers to fully utilize the capabilities of modern processors.