Programming - SIMD Instructions in OpenMP

[Image 1]


Hey it's a me again @drifter1!

Today we continue with the Parallel Programming series about the OpenMP API. I highly suggest you to go read the previous articles of the series, that you can find by the end of this one. The topic of this article is how we specify that SIMD Instructions should be used using OpenMP!

So, without further ado, let's get straight into it!

GitHub Repository

Requirements - Prerequisites


A parallel region is defined using a parallel region construct:

#pragma omp parallel [clause ...]
    /* This code runs in parallel */
Parallel For Loops are defined by using the following syntax:
int i;
#pragma omp parallel for private(i) [clause ...]
for(i = 0; i < N; i++){
Non-iterative work-sharing is defined using a sections construct with nested section directives:
#pragma omp sections [clause ...]
    /* run in parallel by all threads of the team */
#pragma omp section { /* run once by one thread */ }
#pragma omp section { /* run once by one thread */ } ... }
When only one section inside a parallel region has to be executed by one thread then we can use the single construct:
#pragma omp single
Critical Sections are specified using the following construct:
#pragma omp critical [[(name)] [ hint (hint-expr)]] [clause ...]
    /* critical section code */
Atomic operations are defined as:
#pragma omp atomic [read | write | update | capture] [hint(hint-expr)] [clause ...]
    /* statement expression*/
OpenMP also offers Routines for Lock Management of omp_lock_t general-purpose locks:
  • omp_init_lock(): Initialize a lock associated with the lock variable
  • omp_init_lock_with_hint(): Initialize a lock associated with the lock variable using synchronization hints
  • omp_destroy_lock(): Disassociate the given lock variable from any locks
  • omp_set_lock(): Acquire the ownership of a lock, and suspend execution until the lock is set
  • omp_unset_lock(): Release a lock
  • omp_test_lock(): Attempt to set the lock, but don't block if the lock is unavailable
When more-advanced work-sharing is needed, Task constructs are the way to go:
#pragma omp task [clause ...]
    /* code block */
Implicit Barriers are placed at the end of each parallel section (when no nowait clause has been specified).

Explicit barriers can be placed by using the barrier directive:
#pragma omp barrier
In order to force memory consistency, the flush directive is used:
#pragma omp flush [(list)]

What is SIMD?

Let's start off with the basics. SIMD means Single Instruction on Multiple Data, and can be thought of as Vector Parallelism. In other words, the processors implement special instructions that perform the same calculation on multiple values at once. This is commonly used in GPUs, but can also speed up calculation on CPUs. Most modern CPUs contain AVX instruction extensions in them!


To understand it better let's get into an example.

Let's say we have two one-dimensional vectors of size N, A and B. Our Processing Unit has N different computing units for a specific operation (a vector processing unit for N-sized vectors). An one-dimensional vector of size N, C, is the result of that operation, for each corresponding Ai and Bi.

So, mathematically each element Ci is calculated as follows:

As an SIMD Instruction, this becomes:

and so all elements, Ci, are calculated in parallel.

If the vector was larger, let's say M, then it would be split into chuncks of size N, so that the operations can be applied in parallel. Each of these vector instructions is padded with No-Op's when no data is available.

SIMD in OpenMP

So, how do we specify that SIMD Instructions should be used to implement a loop?

In order to indicate to the compiler that a for loop should be transformed into SIMD Instructions, we use the - you guessed it - simd directive:

#pragma omp simd [clause]
    /* for loops */


The behavior of the simd directive can be tweaked using the following clauses:

  • private, lastprivate - Specify the data-sharing of lists of variables
  • reduction - Specify that the compiler should try to reduce the iterations using a specific operator identifier
  • collapse - Specify how many nested loops should be associated with the construct
  • safelen - Specify the greatest distance between two logical iterations's data (basically when two addresses are independent)
  • simdlen - Specify the number of variables to be passed in SIMD registers
  • linear - Specify a list of variables that are private and have linear relationship in respect to the iteration space
  • aligned - Specify that the list of variables should be aligned to a number of bytes

SIMD in Function Definitions/Declarations

There is also a directive that can be used to specify that a complete function definition (or declaration) should be implemented using SIMD Intructions, so that it can be called from an SIMD loop.

The directive for that purpose is declare simd:

#pragma omp declare simd [clause]
    /* function definition or declaration */

This directive can be tweaked using the simdlen, linear and aligned clauses as well as:

  • uniform - Specify that the one or more arguments have invariant values for all concurrent invocations of the function
  • inbranch - Specify that the function will be called inside of conditional statements of a SIMD loop
  • notinbranch - Specify that the function will never be called inside of a conditional statement of a SIMD loop

For Simd Directive

Lastly, OpenMP can also combine the for and simd directive together, using the shortcut:

#pragma omp for simd [clause]
    /* for loops */
Using this directive we specify a for loop that executes concurrently using SIMD instructions, with those iterations being executed in parallel by threads in a team.

All clauses that are accepted by simd and for directives can be used to tweak the for simd shortcut.

Example Program

Let's multiply to large vectors element-wise using:

  • serial processing
  • simd construct
  • parallel for construct
  • parallel for simd construct
The vectors will be of type unsigned short, which is of 2 Bytes, and have about 4 Billion elements (UINT_MAX from limits.h) each!

[Image 2]

The total ram usage is therefore about 26 Gigabytes!

Don't worry I have 32 gigs - he said...
*Behind the scenes*

Well, that was unexpected - he said...
Chrome is a ram-eater though, I was already at about 5 Gigabytes without running the code.

To calculate the execution time, we will use the omp function omp_get_wtime(), which gives more accurate results. :)

Vector Allocation and Initialization

The vectors can be defined as pointers to unsigned short and then allocated using calloc:

#define N UINT_MAX
unsigned short *A, *B, *C;
/* allocate memory */ A = (unsigned short *) calloc(N, sizeof(unsigned short)); B = (unsigned short *) calloc(N, sizeof(unsigned short)); C = (unsigned short *) calloc(N, sizeof(unsigned short));
The vectors A and B are filled with random values of 1 byte (0 to 255), so that the multiplication doesn't cause any overflow (255 * 255 = 65025 < 216 - 1 = 65535):
for(i = 0; i < N; i++){
    A[i] = rand() % 256;
    B[i] = rand() % 256;

Serial Processing

The element-wise multiplication of A[i] with B[i] results in C[i] and so can be described by the following for loop:

for (i = 0; i < N; i++){
    C[i] = A[i] * B[i];

SIMD Processing

In order to use simd instructions instead of "generic" instructions, we add an simd directive before the loop:

#pragma omp simd
for (i = 0; i < N; i++){
    C[i] = A[i] * B[i];
Note that the simd is not marked as parallel and thus the execution is basically still sequential!

Parallel For Processing

This case is also perfect for parallel for loops, and so the loop can be put after a parallel for construct with the loop counter i set as private:

#pragma omp parallel for private(i)
for (i = 0; i < N; i++){
    C[i] = A[i] * B[i];

Parallel For SIMD Processing

Lastly, its also possible to define a parallel for simd construct, which optimizes the loop by executing simd instructions:

#pragma omp parallel for simd private(i)
for (i = 0; i < N; i++){
    C[i] = A[i] * B[i];

Console Output

The console outputs:


We conclude that serial processing is by far the slowest. SIMD somewhat speeds up the result, but not by much in the case of this program. And Parallel For or Parallel For SIMD is basically the same, because OpenMP automatically detects that the instructions should be SIMD instructions in this simple case. Either way, combining Parallel For with SIMD Instructions (the parallel for simd construct) is a great way of optimizing our programs, when it can be applied. It might have been some milliseconds in this program, but this could translate into minutes or even hours in complicated algorithms. And also, why should we not use the vector-specific processing units that the CPU offers?






Previous articles about the OpenMP API

Final words | Next up

And this is actually it for today's post!

I'm not sure what more to cover, so the next article about Parallel Programming might be about Nvidia's CUDA API!

See ya!

Keep on drifting!

3 columns
2 columns
1 column
Join the conversation now