Memory Spaces

There are two special memory spaces exposed in OKL through the qualifiers

  • shared
  • exclusive

Shared Memory

The concept of shared memory is taken from the GPU programming model, where parallel threads/workitems can share data. Adding the shared qualifier when declaring a variable type will allow the data to be shared across inner loop iterations.

Note

Shared memory array sizes must be known at kernel compile time. This could still mean runtime to the application due to OCCA JIT compilation of kernels.

As an example, we could do a local reduction

for (...; outer) {
  shared int values[4];
  // values = [?, ?, ?, ?]
  for (int i = 0; i < 4; ++i; inner) {
    values[i];
  }
  // values = [0, 1, 2, 3]
  for (int i = 0; i < 4; ++i; inner) {
    if (i < 2) {
      values[i] += values[i + 2];
    }
  }
  // values = [2, 4, 2, 3]
  for (int i = 0; i < 4; ++i; inner) {
    if (i < 2) {
      values[i] += values[i + 2];
    }
  }
  // [6, 4, 2, 3] where 6 = (0 + 1 + 2 + 3)
}

With performance in mind, the ability to share data could allow for

  • Parallel fetching of data
  • Reuse of fetched data
  • Faster memory access due to CPU cache and GPU’s in-chip memory

Exclusive Memory

Having the option of opening and closing inner loops to synchronize work at the inner loop level raises the question

What happens if we want data to persist across inner loops?

Here’s an example depicting this problem

int id;
for (int i = 0; i < 4; ++i) {
  id = i;
}
for (int i = 0; i < 4; ++i) {
  // id = ???
}

Since the order of execution is unknown at compile-time, we cannot assert the value if id in the second inner loop. We introduce the exclusive qualifier to explicitly declare a variable unique to an inner loop iteration.

An analogy that might help understand how exclusive works is the use of thread-local storage. When you define a variable to live in thread-local storage, each thread in the application has its own unique instance of it

// Each thread has its own independent variable 'x'
thread_local int x;

Rather than each thread having its own instance of x in the above example, each loop iteration has its own version of an exclusive variable.

exclusive int id;
for (int i = 0; i < 4; ++i) {
  id = i;
}
for (int i = 0; i < 4; ++i) {
  // id = i (dependent on the loop iteration)
}

In programming models where additional memory spaces exist such as

  • CPU: RAM and in-chip memory (cache and registers)
  • GPU: global, shared, and register memory

prefetching can be viewed as an optimization strategy.