SHTns  3.4.6
Using SHTns with GPU (CUDA)

SHTns supports nvidia Kepler and Pascal GPUs, using optimized cuda transforms.

There are two ways to use GPUs with SHTns: automatic off-loading where the GPU is used transparently (not much to change in your code) and SHTns handles all the data transfers between GPU and CPU memory. In addition, routines for performing transforms of data residing already on the GPU are provided.

In all cases, you must first configure SHTns to use cuda:

./configure --enable-cuda
make install

Automatic off-loading

To enable automatic off-loading (if possible and if faster), simply add SHT_ALLOW_GPU to the shtns_type argument of shtns_set_grid_auto.

sht_mode |= SHT_ALLOW_GPU
shtns_set_grid_auto(shtns, shtmode, polaropt, nlorder, &NLAT, &NPHI);
int shtns_set_grid_auto(shtns_cfg, enum shtns_type flags, double eps, int nl_order, int *nlat, int *nphi)
Precompute everything and choose the optimal nlat and nphi for a given non-linear order.
Definition: sht_init.c:1316
allows to use a GPU. This needs special care because the same plan cannot be used simultaneously by d...
Definition: shtns.h:68
the transforms are no more guaranteed to be thread-safe, because the GPU ones are not. Please make sure to use different shtns configs (or plans) if you want to call transform functions from multiple threads. See also Cloning shtns configs below.

For best performance, you must also allocate memory for your spatial and spectral arrays using shtns_malloc and subsequently free it with shtns_free. This way, so-called "pinned" memory is used when cuda is enabled, allowing faster data transfer between host and device.

If you are not calling transform functions from multiple threads, you have nothing more to do.

Here is a table giving an idea of the performance, comparing intel broadwell, intel KNL, and nvidia Pascal:

machine SH_to_spat spat_to_SH
Broadwell, 20 cores 7.80 ms 8.41 ms
KNL 7250, 68 cores 3.65 ms 3.61 ms
Pascal P100 GPU (including transfer) 3.46 ms 4.24 ms
Average wall time for forward or backward scalar transform on various machines. The transform were performed using time_SHT with lmax=1023 and -nlorder=2 (dealiasing for non-linear terms). Note that the GPU timings include the time to transfer data (auto-offload mode) from and to the IBM Power8 host through nvlink. All timings include the FFT.

On-device transforms

On device transforms are declared in shtns_cuda.h. Basically the regular transform functions have been prefixed by cu_ to indicate that they work on device memory. See GPU transforms.. In principle, it is also possible to use cuda unified memory allocated by cudaMallocManaged(), but this has not been tested.

As for the automatic off-loading, the GPU transforms are NOT thread-safe and the same shtns config should never been used from simultaneous threads. To help the user to create clones of the same shtns config, cloning functions for GPU transforms have been added.

Cloning shtns configs

The function cushtns_clone creates a new identical configuration, but with new GPU ressources (temporary buffers, streams, ...). The user can give custom cuda streams or let shtns set his own (by passing 0 for each cuda stream).

The resulting shtns config can be used concurently with the original one. Repeat the cloning for each concurrent thread.