Design a site like this with WordPress.com
Get started

Max Out your CPU with OPENMP

What is multithreading ? Click here for the Wikipedia definition: https://en.wikipedia.org/wiki/Multithreading_(computer_architecture). Simply put, multithreading is the ability of an application or a process to run many tasks in parallel by the CPU (the Processor, usually a multi-core processor). Each of those tasks is called a thread.

The main benefit is obviously the gain in speed while executing a process, and the joy that results from making good use of your machine’s CPU resources ! (after all, you or your company paid for it !)

In this article, I will demonstrate how easy it is to multi-thread a program with the OpenMP API (https://www.openmp.org/).

1 – Hello World

Consider the following code:

#include <stdio.h>
int main()
{
  printf("Hello World\n");
  return 0;
}

Compile and execute:

gcc  -o  helloworld  helloword.c 
./helloworld
mono-threaded hello-world program

Now let’s modify slightly our program by adding OpenMP directives:

#include <stdio.h> 
#include <omp.h>

int main()
 {  
   #pragma omp parallel 
   printf("Hello World\n");
   return 0;
 }

Compile & execute:

gcc  -o  helloworld  helloworld.c  -fopenmp

I now have 12 “Hello World” printed on the screen because I have a total of 12 cores on my thinkpad laptop; The parallel region is the part of the code we want to execute in parallel. To define such a region, we write the directive #pragma omp parallel just before the block to run in parallel.

We can specify the number of threads to use. For e,g,, to use 4 threads, the directive is:
#pragma omp parallel num_threads(4).

2- Barriers

Some threads are faster than others, and there are some situations where we need all tasks of a process to be at the same point. Let’s write a multi-threaded program that displays three colors: GREEN, ORANGE, RED in this order. But we want all threads to print the GREEN color before they all print the ORANGE color and finally the RED color.

#include <stdio.h>
#include <omp.h>

int main()
{
  #pragma omp parallel num_threads(4)
  {
    printf("Thread %d: \033[32mX\033[0m\n",omp_get_thread_num());
    printf("Thread %d: \033[33mX\033[0m\n",omp_get_thread_num());
    printf("Thread %d: \033[31mX\033[0m\n",omp_get_thread_num());
  }
return 0;
}

The above figure shows that in our code, the threads are not waiting for each other.
Let’s change this behavior by adding the directive #pragma omp barrier at the end of each step:

#include <stdio.h>
#include <omp.h>

int main()
{
  #pragma omp parallel num_threads(4)
  {
    printf("Thread %d: \033[32mX\033[0m\n",omp_get_thread_num());
    #pragma omp barrier
    printf("Thread %d: \033[33mX\033[0m\n",omp_get_thread_num());
    #pragma omp barrier
    printf("Thread %d: \033[31mX\033[0m\n",omp_get_thread_num());
    #pragma omp barrier
  }
return 0;
}

3 – Single Directive

The single directive is useful to run part of a multi-threaded program only once and by only one thread. Let’s say for e.g. that we have a classroom of N students and we want to:
– Store N random marks of the N students;
– Compute and display the cumulative sum of the N marks stored;
With a program using 2 threads, each of the threads will perform the two tasks.

#include <stdio.h>
#include <omp.h>
#include <stdlib.h>
#include <time.h>

#define N 1000

void store_marks(int *tab,int *call)
{
 //fill the marks array with random numbers in the range 0 - 20
   for (int i=0;i<N;i++) tab[i]=(rand( )%20);
   (*call)++;  
}

int main( )
{
  int tab[N];
  int call=0;
  long sum=0;
  srand(time(0));
  #pragma omp parallel num_threads(2)
  {
    store_marks(tab,&call);
    if (omp_get_thread_num()==1)
      {
        for (int i=0;i<N;i++) sum+=tab[i];
      }
   }
  printf("The sum is %ld and the store_marks function has been called %d times\n",sum,call);
  return 0;
}

The result shows the N marks have been stored as many times as we do have threads; To change this behavior, let’s specify the single directive of openmp:

#include <stdio.h>
...

int main( )
{
  ...
  #pragma omp parallel num_threads(2)
  {
    #pragma omp single
    store_marks(tab,&call);
    if (omp_get_thread_num()==1)
      {
        for (int i=0;i<N;i++) sum+=tab[i];
      }
  }
...
}

4 – Critical directive

To illustrate the critical directive, let’s imagine a game with cars. We’ll write a code where a car is represented by a thread and has a color attached to it. The program will mimic the movements of cars, every car will execute 3 moves in a row.

#include <stdio.h> 
#include <omp.h>

void colorcar(int t)
{
 switch(t)
      {
        case 0: printf("\033[0m");
                break;
        case 1: printf("\033[31m");
                break;
        case 2: printf("\033[32m");
                break;
        case 3: printf("\033[33m");
                break;
        case 4: printf("\033[34m");
                break;
        case 5: printf("\033[35m");
                break;
        case 6: printf("\033[36m");
                break;
        case 7: printf("\033[37m");
                break;
      }
}

int main()
{
  #pragma omp parallel num_threads(8)
  {
    #pragma omp critical
    {
      colorcar(omp_get_thread_num());
      printf("Car%2d --> move ... ",omp_get_thread_num());
      colorcar(omp_get_thread_num());
      printf("Car%2d --> move ... ",omp_get_thread_num());
      colorcar(omp_get_thread_num());
      printf("Car%2d --> move ... \n",omp_get_thread_num());
    }
  }
  printf("\n\033[0m\n");
  return 0;
}

The code runs as expected: each line is a thread representing a car and there is no mix: every car does its 3 moves uninterruptedly. A critical directive allows a block of a multi-threaded program to be executed by only one thread at a time without interruption.

To understand what can go wrong, we need to state that in the world of multi-threading computing, there is one king: the scheduler. This component of the CPU can make a thread stop at anytime and allow another one to take over and continue, interrupt it, gives priority to another one, etc… the critical directive allows us to “control” that behavior.
Now let’s see what happens if we remove the directive #pragma omp critical:

Here is the disorder ! the threads are not waiting to perform all their 3 moves anymore: they interrupt each other, even the colors of the cars get mixed.

Check this link to visualize the differences better: https://asciinema.org/a/362969

5 – Section directive

The section directive is very simple way to assign tasks to threads. Note however that by default, each thread that is done with its section will wait for the others to finish. This can be changed with the nowait directive.

Suppose for e.g. we want to count the number of prime numbers below 100000.
Let’s intentionally use a slow algorithm and repeat the task 3 times. We’ll then compare the time it takes to count between a multi-threaded program with section directive and a single-threaded program:

#include <stdio.h>
#include <omp.h>

#define N 100000

int p1()
{
  int count=0,prime=1;
  for (int i=1;i<N;i++)
  {
    for (int j=2;j<i;j++)
    {
      if (!(i%j)) prime=0;
    }
    if (prime) count+=1;
    else prime=1;
  }
  printf("\n");
  return count;  
}

int p2()
{
  int count=0,prime=1;
  
  for (int i=1;i<N;i++)
  {
    for (int j=2;j<i;j++)
    {
      if (!(i%j)) prime=0;
    }
    if (prime) count+=1;
    else prime=1;
  }
  printf("\n");
  return count;  
}

int p3()
{
  int count=0,prime=1;
  for (int i=1;i<N;i++)
  {
    for (int j=2;j<i;j++)
    {
      if (!(i%j)) prime=0;
    }
    
    if (prime) count+=1;
    else prime=1; 
  }
  printf("\n");
  return count;
}


int main()
{
      printf("\nCount by Thread %d = %d\n",omp_get_thread_num(),p1());
      printf("\nCount by Thread %d = %d\n",omp_get_thread_num(),p2());
      printf("\nCount by Thread %d = %d\n",omp_get_thread_num(),p3());
  return 0;

}

Same code modified with omp directives:

#include <stdio.h>
#include <omp.h>
#define N 100000

int p1()
{
  int count=0,prime=1;
  for (int i=1;i<N;i++)
  {
   for (int j=2;j<i;j++)
   {
    if (!(i%j)) prime=0;
   }
   if (prime) count+=1;
   else prime=1;
   }
 printf("\n");
 return count;
}

int p2()
{
 int count=0,prime=1;
 for (int i=1;i<N;i++)
 {
  for (int j=2;j<i;j++)
  {
   if (!(i%j)) prime=0;
  }
  if (prime) count+=1;
  else prime=1;
 }
 printf("\n");
 return count;
}

int p3()
{
 int count=0,prime=1;
 for (int i=1;i<N;i++)
 {
  for (int j=2;j<i;j++)
  {
   if (!(i%j)) prime=0;
  }
  if (prime) count+=1; else prime=1;
 }
 printf("\n");
 return count;
}

int main()
{
 #pragma omp parallel
 {
  #pragma omp sections
  {
    #pragma omp section
    {
      printf("\nCount by Thread %d =   %d\n",omp_get_thread_num(),p1());
    }
    #pragma omp section
    {
      printf("\nCount by Thread %d = %d\n",omp_get_thread_num(),p2());
    }
    #pragma omp section
    {
      printf("\nCount by Thread %d = %d\n",omp_get_thread_num(),p3());
    } 
   }
  }
return 0;
}

We record a neat difference: more than twice faster with the multi-threaded process (on the left on the below figure):

Visualize it here ==> https://asciinema.org/a/362985

6 – Loops

OpenMP gives a very practical way to handle loops with the directive #pragma omp for.
For N iterations and x Threads, OpenMP will “share” the loop among the x threads: each thread will execute a portion of N/x iterations:
Example: let’s fill the screen with a pattern of characters by using printf to draw a line a certain number of times:
Code 1 (single-threaded):

#include <stdio.h>
#include <unistd.h>
#include <omp.h>

int main()
{
 int i,j;
 for (j=0;j<10;j++)
 {
  for (i=0;i<50;i++){fflush(stdout);printf("\033[31mo\033[0m="); usleep(60000);}
 }
 printf("\033[0m]");
 return 0;
}

Code 2 (multi-threaded):

int main()
{
int i,j;
 #pragma omp parallel for collapse(2)
 {
  for (j=0;j<10;j++) 
  {
   for (i=0;i<50;i++){fflush(stdout);printf("\033[32m>\033[33m+");usleep(60000);} 
  }
printf("\033[0m]");
return 0;
}

The keyword collapse is required with nested loops. In this case we have 2 nested loops. Beware, however that the loops should be perfectly structured: it is not allowed to have line(s) of code in between.

The difference in performance between the two similar programs is evident: https://asciinema.org/a/362992 .

Conclusion

This was quite a long post as there are a lot of options with OpenMP and with multi-threading in general. I’ve introduced a few of them but tutorials are available at the official website https://www.openmp.org/resources/tutorials-articles/.
If you are fascinated like me with getting all the moving parts of your CPU working, then you should dive deep into OpenMP !

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: