In this section, we present performance numbers for the computational kernels, that is the 3d-FFT and the matrix-matrix multiplies. For any system larger than about 5 atoms, more than 95% of the time is spent inside these kernels. We gauge their speed on our three production platforms: The SGI PowerChallenge, the Cray T3E, and the IBM SP2. All are based on fast, super-scalar RISC processors: The 16 nodes of the PowerChallenge (at NCSA in Illinois) are equipped with 195 MHz MIPS R10000 processors, whereas the 128 nodes of the T3E have a 300 MHz Alpha EV5 chip inside. On the IBM SP2, we use the ``thin'' nodes with the 66.7 MHz Power2 chip. In contrast to the PowerChallenge, the T3E and SP2 are scalable parallel computers.
As far as the memory hierarchies are concerned, the SGI Power Challenge has a unified 2 MB floating point data cache per node, which is also used as a secondary cache for instructions and integer data. The nodes go through a wide, fast, shared snoopy bus to access the global shared memory. The T3E on the other hand is a distributed memory machine, where the individual nodes have 256 MB of memory, and the EV5 Chip has a 8 KB direct-mapped L1 and a 96 KB 3-way set associative L2 cache. The T3E nodes communicate through a scalable high-bandwidth low-latency interconnect. The IBM also is a distributed memory machine, but has a fat-tree high-performance network connecting its (``thin'') nodes. Those have a Power2 chip with a 64KB L1 cache and no secondary cache. In contrast to the ``thick'' nodes, the bus to local memory is only 64 bits wide.
As a test problem, we took a sample of amorphous hydrogenated carbon
with 64 carbon and 12 hydrogen atoms. We use a plane-wave cutoff of 90
Rydbergs which leads to a matrix size of N=57000, and the number
of eigenvectors
is m=21.
The inverse and forward 3d-FFT involved in applying
to a
wave function requires a grid of
. The 1d-FFTs
are performed by calls to the LIBSCI (Cray), COMPLIB (SGI), and ESSL
(IBM) library routines. On all three machines, the native MPI
implementations are used.
Figure 4 shows the performance per node for a varying
number of processors. On the T3E we had to run on 4 or more processors
in order to fit the data into memory. Although the T3E at NERSC has
currently 128 nodes, only a maximum of 64 are available at the moment,
and only the 32 node queues are conveniently accessible. To compute
the MFLOPS, the operation count for a 1d-FFT of length p was assumed
to be
.
Figure 4: Floating point performance in MFLOPS/node on
the Cray T3E, the SGI PowerChallenge, and the IBM SP2 for varying number of
processors. The 3d-FFT has a grid size of
. The
matrix-matrix multiplies are of size
.
All machines fall short of their peak performances of 390 MFLOPS/node
(SGI PowerChallenge), 600 MFLOPS/node (Cray T3E), and 266 MFLOPS/node
(IBM SP2). The FFTs are very memory-access intensive, and are not
expected to perform too well on RISC-based machines where the memory
access times are much longer than on vector architectures. Indeed we
find them to run at about 76 MFLOPS/node on 4 T3E processors,
decreasing gradually due to communication to about 62 MFLOPS/node on
32 processors. On the PowerChallenge, they run at about 48 MFLOPS/node
on a single processor. When running on more processors, the required
communication first reduces the performance on the PowerChallenge. On
8 or 16 processors though, one can see the effects of the increased
aggregate cache, which leads to super-linear speedup when running on
16 nodes (54 MFLOPS/node). The performance curve of the SP2 has a
characteristic similar to the one of the T3E. It starts out at about
45 MFLOPS/node on two nodes, and decreases steadily to about 26 for
32 processors. Notice that the absolute performance numbers one would
get from a simple benchmark test on a 1d-FFT are much higher, since
our number includes not only the communication overhead, but also
time-consuming memory copies for the transpose operation and the time
to apply the operator
between the inverse and the
forward FFT. The 3d-FFT intrinsically implies a large amount of
communication, and requires a parallel supercomputer as opposed to a
cluster of workstations.
On the matrix-matrix multiplies, the Cray excels with a performance
between 242 MFLOPS/node (on 4 processors) and 236 MFLOPS/node (on 32
processors). The IBM slows down from 124 MFLOPS/node (2 processors) to
113 MFLOPS/node (32 processors) in a similar fashion. The performance
curve of the PowerChallenge has more structure. Like in the case of
the 3d-FFT, we see a super-linear speedup for the matrix-matrix
multiplies, but this time much more pronounced. This is most likely
also cache related. A simple matrix-matrix multiply benchmark reveals
that the performance of the PowerChallenge degrades substantially when
the shape of the rectangular matrices deviates strongly from the
square. This is the case for the matrices multiplied here, because
and
if
is
small. Using more processors thus brings the matrices into a more
convenient shape. The fact that the Cray T3E and the IBM SP2 are not
showing such sensitivity to the shape of the matrix might indicate a
superior cache blocking technique of the ZGEMM routine in LIBSCI and
ESSL.