Programming - SIMD Instructions in OpenMP

Introduction

Hey it's a me again @drifter1!

Today we continue with the Parallel Programming series about the OpenMP API. I highly suggest you to go read the previous articles of the series, that you can find by the end of this one. The topic of this article is how we specify that SIMD Instructions should be used using OpenMP!

So, without further ado, let's get straight into it!

GitHub Repository

https://github.com/drifter1/openmptutorials

Requirements - Prerequisites

Basic understanding of the Programming Language C, or even C++
Familiarity with Parallel Computing/Programming in general
Compiler

Linux users: GCC (GNU Compiler Collection) installed
Windows users: MinGW32/64 - To avoid unnessecary problems I suggest using a Linux VM, Cygwin or even a WSL Environment on Windows 10
MacOS users: Install GCC using brew or use the Clang compiler
For more Compilers & Tools check out: https://www.openmp.org//resources/openmp-compilers-tools/

Previous Articles of the Series

Recap

A parallel region is defined using a parallel region construct:

#pragma omp parallel [clause ...]
{
    /* This code runs in parallel */
}

Parallel For Loops are defined by using the following syntax:

int i;
#pragma omp parallel for private(i) [clause ...]
for(i = 0; i < N; i++){
    ...
}

Non-iterative work-sharing is defined using a sections construct with nested section directives:

#pragma omp sections [clause ...]
{
    /* run in parallel by all threads of the team */

    #pragma omp section
    {
        /* run once by one thread */
    }

    #pragma omp section
    {
        /* run once by one thread */
    } 
    ...
}

When only one section inside a parallel region has to be executed by one thread then we can use the single construct:

#pragma omp single

Critical Sections are specified using the following construct:

#pragma omp critical [[(name)] [ hint (hint-expr)]] [clause ...]
{
    /* critical section code */
}

Atomic operations are defined as:

#pragma omp atomic [read | write | update | capture] [hint(hint-expr)] [clause ...]
    /* statement expression*/

OpenMP also offers Routines for Lock Management of omp_lock_t general-purpose locks:

omp_init_lock(): Initialize a lock associated with the lock variable
omp_init_lock_with_hint(): Initialize a lock associated with the lock variable using synchronization hints
omp_destroy_lock(): Disassociate the given lock variable from any locks
omp_set_lock(): Acquire the ownership of a lock, and suspend execution until the lock is set
omp_unset_lock(): Release a lock
omp_test_lock(): Attempt to set the lock, but don't block if the lock is unavailable

When more-advanced work-sharing is needed, Task constructs are the way to go:

#pragma omp task [clause ...]
{
    /* code block */
}

Implicit Barriers are placed at the end of each parallel section (when no nowait clause has been specified).

Explicit barriers can be placed by using the barrier directive:

#pragma omp barrier

In order to force memory consistency, the flush directive is used:

#pragma omp flush [(list)]

What is SIMD?

Let's start off with the basics. SIMD means Single Instruction on Multiple Data, and can be thought of as Vector Parallelism. In other words, the processors implement special instructions that perform the same calculation on multiple values at once. This is commonly used in GPUs, but can also speed up calculation on CPUs. Most modern CPUs contain AVX instruction extensions in them!

Example

To understand it better let's get into an example.

Let's say we have two one-dimensional vectors of size N, A and B. Our Processing Unit has N different computing units for a specific operation (a vector processing unit for N-sized vectors). An one-dimensional vector of size N, C, is the result of that operation, for each corresponding A_i and B_i.

So, mathematically each element C_i is calculated as follows:

As an SIMD Instruction, this becomes:

and so all elements, C_i, are calculated in parallel.

If the vector was larger, let's say M, then it would be split into chuncks of size N, so that the operations can be applied in parallel. Each of these vector instructions is padded with No-Op's when no data is available.

SIMD in OpenMP

So, how do we specify that SIMD Instructions should be used to implement a loop?

In order to indicate to the compiler that a for loop should be transformed into SIMD Instructions, we use the - you guessed it - simd directive:

#pragma omp simd [clause]
    /* for loops */

Clauses

The behavior of the simd directive can be tweaked using the following clauses:

private, lastprivate - Specify the data-sharing of lists of variables
reduction - Specify that the compiler should try to reduce the iterations using a specific operator identifier
collapse - Specify how many nested loops should be associated with the construct
safelen - Specify the greatest distance between two logical iterations's data (basically when two addresses are independent)
simdlen - Specify the number of variables to be passed in SIMD registers
linear - Specify a list of variables that are private and have linear relationship in respect to the iteration space
aligned - Specify that the list of variables should be aligned to a number of bytes

SIMD in Function Definitions/Declarations

There is also a directive that can be used to specify that a complete function definition (or declaration) should be implemented using SIMD Intructions, so that it can be called from an SIMD loop.

The directive for that purpose is declare simd:

#pragma omp declare simd [clause]
    /* function definition or declaration */

This directive can be tweaked using the simdlen, linear and aligned clauses as well as:

uniform - Specify that the one or more arguments have invariant values for all concurrent invocations of the function
inbranch - Specify that the function will be called inside of conditional statements of a SIMD loop
notinbranch - Specify that the function will never be called inside of a conditional statement of a SIMD loop

For Simd Directive

Lastly, OpenMP can also combine the for and simd directive together, using the shortcut:

#pragma omp for simd [clause]
    /* for loops */

Using this directive we specify a for loop that executes concurrently using SIMD instructions, with those iterations being executed in parallel by threads in a team.

All clauses that are accepted by simd and for directives can be used to tweak the for simd shortcut.

Example Program

Let's multiply to large vectors element-wise using:

serial processing
simd construct
parallel for construct
parallel for simd construct

The vectors will be of type unsigned short, which is of 2 Bytes, and have about 4 Billion elements (UINT_MAX from limits.h) each!

[Image 2]

The total ram usage is therefore about 26 Gigabytes!

Don't worry I have 32 gigs - he said...

*Behind the scenes*

Well, that was unexpected - he said...

Chrome is a ram-eater though, I was already at about 5 Gigabytes without running the code.

To calculate the execution time, we will use the omp function omp_get_wtime(), which gives more accurate results. :)

Vector Allocation and Initialization

The vectors can be defined as pointers to unsigned short and then allocated using calloc:

#define N UINT_MAX

...

unsigned short *A, *B, *C;

/* allocate memory */
A = (unsigned short *) calloc(N, sizeof(unsigned short));
B = (unsigned short *) calloc(N, sizeof(unsigned short));
C = (unsigned short *) calloc(N, sizeof(unsigned short));

...

The vectors A and B are filled with random values of 1 byte (0 to 255), so that the multiplication doesn't cause any overflow (255 * 255 = 65025 < 2¹⁶ - 1 = 65535):

srand(time(NULL));
for(i = 0; i < N; i++){
    A[i] = rand() % 256;
    B[i] = rand() % 256;
}

Serial Processing

The element-wise multiplication of A[i] with B[i] results in C[i] and so can be described by the following for loop:

for (i = 0; i < N; i++){
    C[i] = A[i] * B[i];
}

SIMD Processing

In order to use simd instructions instead of "generic" instructions, we add an simd directive before the loop:

#pragma omp simd
for (i = 0; i < N; i++){
    C[i] = A[i] * B[i];
}

Note that the simd is not marked as parallel and thus the execution is basically still sequential!

Parallel For Processing

This case is also perfect for parallel for loops, and so the loop can be put after a parallel for construct with the loop counter i set as private:

#pragma omp parallel for private(i)
for (i = 0; i < N; i++){
    C[i] = A[i] * B[i];
}

Parallel For SIMD Processing

Lastly, its also possible to define a parallel for simd construct, which optimizes the loop by executing simd instructions:

#pragma omp parallel for simd private(i)
for (i = 0; i < N; i++){
    C[i] = A[i] * B[i];
}

Console Output

The console outputs:

Conclusion

We conclude that serial processing is by far the slowest. SIMD somewhat speeds up the result, but not by much in the case of this program. And Parallel For or Parallel For SIMD is basically the same, because OpenMP automatically detects that the instructions should be SIMD instructions in this simple case. Either way, combining Parallel For with SIMD Instructions (the parallel for simd construct) is a great way of optimizing our programs, when it can be applied. It might have been some milliseconds in this program, but this could translate into minutes or even hours in complicated algorithms. And also, why should we not use the vector-specific processing units that the CPU offers?

RESOURCES:

References

Images

https://makeameme.org/meme/1-billion

Previous articles about the OpenMP API

OpenMP API Introduction → OpenMP API, Abstraction Benefits, Fork-Join Model, General Directive Format, Compilation
Parallel Regions in OpenMP → Parallel construct, Thread management, Basic Clauses, Example programs
Parallel For Loops in OpenMP → Parallel For Construct, Iteration Scheduling, Additional Clauses, Example programs
Parallel Sections in OpenMP → Parallel Sections Construct, Serial Section, Example Program
Atomic Operations and Critical Sections → Process Synchronization Theory, Critical Section Construct, Atomic Operations, Example Program
Tasks in OpenMP → Task Construct, Synchronization, Loops and Groups, Example Programs
Locks and Barriers in OpenMP → Lock and Barrier Synchronization, Flush for Memory Consistency

Final words | Next up

And this is actually it for today's post!

I'm not sure what more to cover, so the next article about Parallel Programming might be about Nvidia's CUDA API!

See ya!

Keep on drifting!