SHTns 3.7
|
SHTns can efficiently use a GPU to perform spherical harmonic transforms. Supported GPUs include nvidia Kepler, Pascal, Volta, Ampere, as well as AMD MI100 and MI200 series.
There are two ways to use GPUs with SHTns:
In all cases, you must first configure SHTns to use cuda (for nvidia devices) or hip (for AMD devices):
For AMD gpus, replace --enable-cuda
by --enable-hip
above. Optionally, you can specify which GPU you are targetting with
or
for instance.
Auto-offloading is simple to use, but will benefit only to transforms large enough to amortize the cost of memory transfers between CPU ans GPU. What "large enough" means depend on details of your system, but gains can reasonably be expected for Lmax > 1000. To enable automatic off-loading (if possible and if faster), simply add SHT_ALLOW_GPU
to the shtns_type argument of shtns_set_grid_auto.
If no GPU is found, the transforms will be performed on CPU instead.
For best performance, you must also allocate memory for your spatial and spectral arrays using shtns_malloc and subsequently free it with shtns_free. This way, so-called "pinned" memory is used when cuda is enabled, allowing faster data transfer between host and device.
If you are not calling transform functions from multiple threads, you have nothing more to do.
The function cushtns_clone creates a new identical configuration, but with new GPU ressources (temporary buffers, streams, ...). The user can give custom cuda streams or let shtns set his own (by passing 0 for each cuda stream).
The resulting shtns config can be used concurently with the original one. Repeat the cloning for each concurrent thread.
On-device transforms are declared in shtns_cuda.h (for both nvidia and AMD gpu). Basically the regular transform functions have been prefixed by cu_
to indicate that they work on device memory. See GPU transforms..
These on-device transforms are usually much faster than CPU transforms (typically 1 GPU performs like 100 CPU cores) provided enough parallelism can be exposed. This means for a single transform Lmax > 200, but if many transforms can be grouped together (see shtns_set_batch), Lmax>63 is enough.
It is also possible to use cuda unified memory allocated by cudaMallocManaged()
, but this has not been tested extensively.
As for the automatic off-loading, the GPU transforms are NOT thread-safe and the same shtns config should never been used from simultaneous CPU threads. To help the user to create clones of the same shtns config, cloning functions for GPU transforms have been added.
One a shtns config has been setup normally by calling shtns_create followed by shtns_set_grid or shtns_set_grid_auto, one simply calls cushtns_init_gpu to prepare the gpu for on-device transforms.
Performance of scalar transforms on GPU (nvidia V100) and CPU (intel SKL) is given in the following plot (excluding memory transfers). The benefit of batching is evident. See also the conference poster Efficient spherical harmonic transforms on GPU.