Fortunately, there are only two performance critical operations that the linear system solver and the iterative eigen solver require:
with the
matrix H:
Due to the special representation of H, this operation involves matrix-matrix multiplies and a FFT, as explained below.
:
These operations are only required for the iterative eigensolver, for instance to achieve orthogonality between the different eigenvectors. By computing (9) for all i and j simultaneously, this turns into matrix-matrix multiplies, so we can implement (9) efficiently as level 3 BLAS calls.
The matrix-vector product
is the more difficult
part of the two kernels. We can cut the computational cost and avoid
the storage of H by exploiting its special structure. As an
operator, it can be
written as:
Since we keep the vectors in Fourier space, the
term is a diagonal matrix, and easy to apply. The second term
(``nonlocal potential'') is a sum of
rank one
matrices, which is also simply applied in Fourier space as a
matrix-matrix multiplication. The local potential term
in (10) is diagonal in a realspace basis.
Therefore, it is applied to a wave function
by first
inverse Fourier transforming
(operation count
), then multiplying with a
diagonal operator in realspace (operation count
), and
finally Fourier transforming back to the plane wave basis. Since
, and m and N are proportional to the number of
atoms
, the second term applied to a set of wave functions
requires
operations. The local term only
scales like
, but has a large
prefactor in front, such that it dominates the computational cost for
systems with less than 50 atoms.
Optimizing the kernel separately from the algorithm can lead to poor
performance. We realized that the Fourier transform involved in
has a large startup cost for bringing the real
space work arrays and the local potential operator
into cache. We gained a factor of 4 improvement by writing a blocked
version of the conjugate gradient linear solver that solves all m
equations in (7) simultaneously, and therefore applies
to the full set of wave
functions. This results in better cache utilization and avoids the low
memory bandwidth of RISC-based architectures. Such an algorithmic
change was not necessary for the conjugate gradient eigen
solver, since all operations there happen naturally in a blocked
fashion.