While the previous sections focussed on algorithmic aspects, in this section we report preliminary performance numbers regarding the actual implementation of the conjugate-gradient scheme. We gauge the speed on two different platforms: The SGI PowerChallenge and the Cray T3E. Both are based on fast, super-scalar RISC processors: The 16 nodes of the PowerChallenge (at NCSA in Illinois) are equipped with 195 MHz MIPS R10000 processors, whereas the 128 nodes of the T3E have a 300 MHz Alpha EV5 Chip inside. However, in contrast to the PowerChallenge, the T3E is a scalable parallel computer.
As far as the memory hierarchies are concerned, the SGI Power Challenge used for our tests has a unified 2 MB floating point data cache per node, which is also used as a secondary cache for instructions and integer data. The nodes go through a wide, fast, shared snoopy bus to access the global shared memory. The T3E on the other hand is a distributed memory machine, where the individual nodes have 256 MB of memory, and the EV5 Chip has a 8 KB direct-mapped L1 and a 96 KB 3-way set associative L2 cache. An additional streaming buffer between L2 cache and memory is designed to boost performance in case of sequential cache misses, but is not functional at the moment and was therefore switched off for our performance tests. The T3E nodes communicate through a scalable high-bandwidth low-latency interconnect.
The application for which the timings were performed is the conjugate
gradient algorithm as outlined in the previous section. The test problem
is a sample of amorphous hydrogenated carbon with
64 carbon and 12 hydrogen atoms. We use a plane-wave cutoff of 90
Rydbergs which leads to a matrix size of N=57,000. We keep the
Hamiltonian H fixed for this test, i.e. we use the
conjugate-gradient scheme to find the 21 smallest eigenvalues and
the corresponding eigenvectors
. Our code is written in
Fortran 90 and uses the MPI library for explicit message passing.
Since we parallelize with respect to plane-wave components, each
processor holds
components of each of the m wave functions,
where
is the number of processors put to work. The
matrix-matrix multiplies we timed are the ones for the second
(nonlocal) term in (3), and are consequently between
rectangular complex matrices of size
, with
. After the local
matrix-matrix multiplies have completed, a global summation is carried
out with the appropriate MPI reduction function.
The three-dimensional Fast Fourier Transform (3d-FFT) and inverse
transform involved in applying
to a trial wave function
requires a grid of
. The 3d-FFT is performed by
means of a series of 1d-FFTs, which are done by calls to optimized
library routines (the COMPLIB on the SGI and the LIBSCI on the
Cray). It requires a series of 1d-FFTs say first in the z direction,
then in the y direction, and finally in the x direction to
complete the 3d-FFT. By partitioning the Fourier space so that the
processors initially hold only complete lines along the z-direction,
we can do the first 1d-FFT
along z entirely locally. Before the 1d-FFTs in the y direction, we
transpose the grid such that every processor holds a set of lines
along the y direction. An analogous transpose is required before the
1d-FFT in the x direction is carried out. The transposition steps
are communication intensive, because every processor sends data to all
others. We use non-blocking MPI library routines to pass the data
between the nodes.
Figure 2 shows the performance per node of the SGI
PowerChallenge and the Cray T3E for a varying number of processors. On
the T3E we had to run on 4 or more processors in order to fit the data
into memory. Although the T3E at NERSC has currently 128 nodes, only a maximum
of 64 are available at the moment. To compute the MFLOPS, the
operation count for a 1d-FFT of length p was assumed to be
.
Figure 2: Floating point performance in MFLOPS/node on
the Cray T3E and the SGI PowerChallenge for varying number of
processors. The 3d-FFT has a grid size of
. The
matrix-matrix multiplies are of size
.
Both machines fall short of their peak performances of 390 MFLOPS/node
(SGI PowerChallenge) and 600 MFLOPS/node (Cray T3E). The FFTs are very
memory-access intensive, and are not expected to perform too well on
RISC-based machines where the memory access times are much longer than
on vector architectures. Indeed we find them to run at about 53
MFLOPS/node on 4 T3E processors, decreasing gradually due to
communication to about 36
MFLOPS/node on 64 processors. On the PowerChallenge, they run at about
48 MFLOPS/node on a single processor. When running on more processors,
the required communication first reduces the performance on the
PowerChallenge. On 8 or 16
processors though, one can see the effects of the increased
aggregate cache, which leads to super-linear speedup when running on 16
nodes (54 MFLOPS/node). Notice that the absolute performance numbers one
would get from a simple benchmark test on a 1d-FFT are much higher,
since our number includes not only the communication overhead, but
also time-consuming memory copies for the transpose operation and the
time to apply the operator
between the inverse
and the forward FFT.
On the matrix-matrix multiplies, the Cray excels with a performance
between 237 MFLOPS/node (on 4 processors) and 221 MFLOPS/node (on 64
processors). The performance curve of the PowerChallenge has more
structure.
Like in the case of the 3d-FFT, we see a super-linear speedup for the
matrix-matrix multiplies, but this time much more pronounced. This is
most likely also cache related. A simple matrix-matrix multiply
benchmark reveals that the performance of the PowerChallenge degrades
substantially when the shape of the rectangular matrices deviates
strongly from the square. This is the case for the matrices multiplied
here, because
and
if
is small. Using more processors thus brings the matrices into a more
convenient shape. The fact that the Cray T3E is
not showing such sensitivity to the shape of the matrix might indicate
a superior cache blocking technique of the ZGEMM routine in the
LIBSCI.
In summary, we find that on both machines, our implementation shows good parallel performance and demonstrates that plane-wave calculations can make efficient use of parallel platforms with a fast communication system. Further optimization is needed though to increase the single-node performance by avoiding time-consuming memory accesses.