CUDA tricks

Takeaways from a few months of CUDA C++

Getting into CUDA, I found that the general architecture is well explained, but did not find much in the form of best practices for CUDA C++. Below are some tips and tricks that have been of use to me.

Feedback most welcome!

0. Check for kernel errors

You might expect errors during kernel execution to crash your program. They don't. Use some error checking wrapper as described here.

1. Use --expt-relaxed-constexpr (and secretly use some STL)

The --expt-relaxed-constexpr flag for nvcc allows the usage of unannotated constexpr functions in device code. Many functions in the standard library are constexpr, and can thus be used relatively safely from device code. You don't have to write your own std::tuple or std::bit_width. The STL is officially unsupported in device code, so formally you have no guarantee that this works. But it does.

2. Use Thrust (for tests)

Thrust comes with the CUDA toolkit and can be very useful for writing some quick tests.

3. Use --extended-lambda

The extended lambdas extension allows you to create device code lambdas. This can for example be handy in combination with the Thrust map, sort, and reduce utilities.

4. __global__ member functions

Member functions cannot be global. This can be annoying if you have a class and want to write a kernel that operates on (a copy of) it. You could write a __global__ function outside the class that takes (a copy of) the class as an argument foo, and then write foo.member every time you have to access a member—but this gets old fast.

An alternative is to do something like:

class C;
__global__ call_f(C);

class C {
    __device__ f_kernel() { ... }

    void f() { call_f<<<...>>>(*this); }

__global__ call_f(C c) { c.f_kernel(); }

...which is not pretty, especially if you need many of these kernels. Moreover, if C were to depend on template arguments, call_f would have to depend on them as well—or you give up on the type system and use template <class T> call_f(T c) instead.

My solution: use --expt-relaxed-constexpr and std::invoke (constexpr!) to define the following:

template <auto F, class... Args>
__global__ void invoke_device(Args... args) {
    std::invoke(F, args...);

The code above can then be rewritten as:

class C {
    __device__ f_kernel() { ... }

    void f() { invoke_device<&C::f_kernel><<<...>>>(*this); }

Much better!

The same code can be used for classes that depend on template arguments (no need to restate them in the template argument to invoke_device), and the parameter pack allows invoke_device to be used for __device__ functions with arguments as well.

Bonus: class-of-pointers and std::shared_ptr

A pattern that I have found to be useful is to have a relatively lightweight class containing pointers to device (or managed) memory, which is then passed to kernels by value.

It can be useful for this class to have a destructor deallocating the pointed-to device memory. However, as it is passed to kernels by value, this destructor will be called after each kernel call. A trick I took from BGHT, is to use std::shared_ptr with cudaFree as a custom deleter instead:

class C {
    std::shared_ptr<Foo> foo;

    C() {
        Foo *_foo;
        cudaMalloc(&_foo, ...);
        foo = std::shared_ptr(_foo, cudaFree);