Loop-based Parallelism

OKL is based on the parallel loop structures found in OpenMP and parallelization of work in blocks/groups found in CUDA and OpenCL.

For example, the most basic OpenMP code looks similar to

#pragma omp parallel
for (int i = 0; i < N; ++i) {
  // Parallel work
}

CUDA and OpenCL basic code would look similar to

__global__ void loop(const int N, ...) {
   const int id = threadIdx.x + (blockIdx.x * blockDim.x);
   if (id < N) {
     // Parallel work
   }
}

loop<<<gridSize, blockSize>>>(N, ...);

The analogous OKL code for both cases would look like

@kernel void loop(const int N, ...) {
  for (int group = 0; group < N; group += blockSize; outer) {
    for (int id = group; id < (group + blockSize); ++id; inner) {
      if(id < N) {
        // Parallel work
      }
    }
  }
}

Two things to note in the above code snippet include:

  1. The use of @kernel to state the loop function is actually a kernel
  2. The for-loops contain a 4th statement with outer and inner labels

We use notations such as these to expose a mix of loop-parallelism and levels of parallel granularity (outer/inner vs block/thread or workgroup/workitem). We’ll start by describing the outer loops (for-loops with the outer label) and inner loops (for-loops with the inner label).

Outer Loops

The outer loops should contain work that is completely parallel and independent of order. For example

for (int i = 0; i < 5; ++i; outer) {
  work(i);
}

could be executed in the expected order

work(0);
work(1);
work(2);
work(3);
work(4);

or in a totally random order

work(0);
work(1);
work(3);
work(4);
work(2);

Additionally, since the iterations can be done in parallel, work(...) blocks could start and end in random order due to latency/throughput factors in memory and work loads. However, it does not mean that work(0), work(1), or work(i) needs to also be executed in parallel.

The second level of parallelism is introduced inside inner loops, where we also parallelize the work inside work(i).

Inner Loops

Similar to outer loops, the order of inner loop iteration cannot be guaranteed. However, the one big difference between the outer and inner loops is the ability to synchronize between inner loops.

For example

for (...; outer) {
  for (int i = 0; i < 5; ++i; inner) {
    work(i);
  }
  // Wait until work(0), work(1), ..., work(4) are finished executing
  for (int i = 0; i < 5; ++i; inner) {
    // More work
  }
}

The importance of this difference might not be clear until we expand on different memory spaces inside OKL.

Outer and Inner Loops

The question now is

Why have two levels of parallel granularity?

We would like to expose the many levels of parallelism in current many-core and multi-core architectures such as

  • CPU: Cores/threads and vector instructions
  • GPU: Multiprocessor, warps, and execution units

Outer loops represent parallelism at a higher level than inner loops

  • CPU: Threads
  • GPU: Multiprocessor

while inner loops act at a lower level

  • CPU: Vector instructions
  • GPU: Execution units

Knowing the distict parallel levels helps when tuning the amount of iterations in outer and inner loops. A loop such as

for (int o = 0; o < N; ++o; outer) {
  for (int i = o; i < (o + 1); ++i; inner) {
  }
}

achieves the same work as

for (int o = 0; o < N; o += 16; outer) {
  for (int i = o; i < (o + 16); ++i; inner) {
  }
}

However, the latter leverages lower-level parallelism better than the first example, possibly decreasing computation time.

Picking a balance between outer and inner loop sizes is heavily dependent on the device architecture. Advice would be to aim for a power of 2 for inner loop sizes to make use of vectorization.