Speed

SHTns does not implement any "fast" algorithm. However, timings with other Spherical Harmonic Transform tools (including a fast algorithm) show that SHTns performs much faster than any other. Furthermore, even at large sizes, the fast algorithm we tested does not seem to be willing to take the lead.
Since v3.2, SHTns implements the new recurrence relation of Ishioka (2018), leading to faster transforms, especially for large transforms.

$\ell_{max}$	shtools 2.8 (Gauss)	libpsht (1 thread)	SpharmonicKit2 2.7 (fast)	SHTns 2.1 (1 thread, Gauss)	SpharmonicKit2/SHTns
63	1.14 ms	1.05 ms	1.1 ms	0.09 ms	12.2
127	3.5 ms	4.7 ms	5.5 ms	0.60 ms	9.2
255	28 ms	27 ms	21 ms	4.2 ms	5.0
511	200 ms	162 ms	110 ms	28 ms	3.9
1023	1.8 s	850 ms	600 ms	216 ms	2.8
2047	13.0 s	4.4 s	NA (out of memory)	1.6 s	NA
4095	NA (seg fault)	30.5 s		11.8 s	NA

Average times for forward or backward scalar transform on an Intel Xeon X5650 (2.67GHz), with gcc 4.4.5 and "-O3 -march=native -ffast-math" compilation options.

Parallel speed

SHTns has parallel algorithms since version 2.2. When compared to libpsht (parallelized with OpenMP too), SHTns is faster especially for relatively small sizes.

$\ell_{max}$	libpsht 20110131	SHTns 2.2.1	libpsht/SHTns
63	5.0 ms	0.05 ms	100
127	5.4 ms	0.22 ms	24.5
255	8.5 ms	1.4 ms	6.1
511	23.5 ms	6.5 ms	3.6
1023	125 ms	43 ms	2.9
2047	700 ms	331 ms	2.1
4095	3.0 s	2.0 s	1.5

Average wall time for forward or backward scalar transform using 12 parallel threads on an Intel Xeon X5650 (2.67GHz), with gcc 4.4.5 and "-O3 -march=native -ffast-math -fopenmp" compilation options.

Accuracy

We claim that the accuracy of SHTns is as good as it can be with double precision floating point math. Rescaling is performed for large transform where the recurrence relation would otherwise underflow the double precision numbers. SHTns has been tested on x86 architecture with SSE2 double precision floating point math (64 bit) to be accurate up to l=16383 at least. The measured error for a back and forth scalar transform using a Gauss-Legendre algorithm for various truncation levels lmax is plotted below.

Don't trust our word, these results are obtained by running the time_SHT program, shipped with SHTns. For example :

make time_SHT

./time_SHT 511 -iter=1 -quickinit