SHTns
3.4.6
|
SHTns supports nvidia Kepler and Pascal GPUs, using optimized cuda transforms.
There are two ways to use GPUs with SHTns: automatic off-loading where the GPU is used transparently (not much to change in your code) and SHTns handles all the data transfers between GPU and CPU memory. In addition, routines for performing transforms of data residing already on the GPU are provided.
In all cases, you must first configure SHTns to use cuda:
To enable automatic off-loading (if possible and if faster), simply add SHT_ALLOW_GPU
to the shtns_type argument of shtns_set_grid_auto.
For best performance, you must also allocate memory for your spatial and spectral arrays using shtns_malloc and subsequently free it with shtns_free. This way, so-called "pinned" memory is used when cuda is enabled, allowing faster data transfer between host and device.
If you are not calling transform functions from multiple threads, you have nothing more to do.
Here is a table giving an idea of the performance, comparing intel broadwell, intel KNL, and nvidia Pascal:
machine | SH_to_spat | spat_to_SH |
Broadwell, 20 cores | 7.80 ms | 8.41 ms |
KNL 7250, 68 cores | 3.65 ms | 3.61 ms |
Pascal P100 GPU (including transfer) | 3.46 ms | 4.24 ms |
time_SHT
with lmax=1023
and -nlorder=2
(dealiasing for non-linear terms). Note that the GPU timings include the time to transfer data (auto-offload mode) from and to the IBM Power8 host through nvlink. All timings include the FFT. On device transforms are declared in shtns_cuda.h. Basically the regular transform functions have been prefixed by cu_
to indicate that they work on device memory. See GPU transforms.. In principle, it is also possible to use cuda unified memory allocated by cudaMallocManaged()
, but this has not been tested.
As for the automatic off-loading, the GPU transforms are NOT thread-safe and the same shtns config should never been used from simultaneous threads. To help the user to create clones of the same shtns config, cloning functions for GPU transforms have been added.
The function cushtns_clone creates a new identical configuration, but with new GPU ressources (temporary buffers, streams, ...). The user can give custom cuda streams or let shtns set his own (by passing 0 for each cuda stream).
The resulting shtns config can be used concurently with the original one. Repeat the cloning for each concurrent thread.