next up previous
Next: References Up: Conjugate-Gradient Based Electronic Structure Previous: A numerical example

Parallel Performance on the Cray T3E and SGI PowerChallenge

While the previous sections focussed on algorithmic aspects, in this section we report preliminary performance numbers regarding the actual implementation of the conjugate-gradient scheme. We gauge the speed on two different platforms: The SGI PowerChallenge and the Cray T3E. Both are based on fast, super-scalar RISC processors: The 16 nodes of the PowerChallenge (at NCSA in Illinois) are equipped with 195 MHz MIPS R10000 processors, whereas the 128 nodes of the T3E have a 300 MHz Alpha EV5 Chip inside. However, in contrast to the PowerChallenge, the T3E is a scalable parallel computer.

As far as the memory hierarchies are concerned, the SGI Power Challenge used for our tests has a unified 2 MB floating point data cache per node, which is also used as a secondary cache for instructions and integer data. The nodes go through a wide, fast, shared snoopy bus to access the global shared memory. The T3E on the other hand is a distributed memory machine, where the individual nodes have 256 MB of memory, and the EV5 Chip has a 8 KB direct-mapped L1 and a 96 KB 3-way set associative L2 cache. An additional streaming buffer between L2 cache and memory is designed to boost performance in case of sequential cache misses, but is not functional at the moment and was therefore switched off for our performance tests. The T3E nodes communicate through a scalable high-bandwidth low-latency interconnect.

The application for which the timings were performed is the conjugate gradient algorithm as outlined in the previous section. The test problem is a sample of amorphous hydrogenated carbon with 64 carbon and 12 hydrogen atoms. We use a plane-wave cutoff of 90 Rydbergs which leads to a matrix size of N=57,000. We keep the Hamiltonian H fixed for this test, i.e. we use the conjugate-gradient scheme to find the 21 smallest eigenvalues and the corresponding eigenvectorsgif. Our code is written in Fortran 90 and uses the MPI library for explicit message passing.

Since we parallelize with respect to plane-wave components, each processor holds components of each of the m wave functions, where is the number of processors put to work. The matrix-matrix multiplies we timed are the ones for the second (nonlocal) term in (3), and are consequently between rectangular complex matrices of size , with . After the local matrix-matrix multiplies have completed, a global summation is carried out with the appropriate MPI reduction function.

The three-dimensional Fast Fourier Transform (3d-FFT) and inverse transform involved in applying to a trial wave function requires a grid of . The 3d-FFT is performed by means of a series of 1d-FFTs, which are done by calls to optimized library routines (the COMPLIB on the SGI and the LIBSCI on the Cray). It requires a series of 1d-FFTs say first in the z direction, then in the y direction, and finally in the x direction to complete the 3d-FFT. By partitioning the Fourier space so that the processors initially hold only complete lines along the z-direction, we can do the first 1d-FFT along z entirely locally. Before the 1d-FFTs in the y direction, we transpose the grid such that every processor holds a set of lines along the y direction. An analogous transpose is required before the 1d-FFT in the x direction is carried out. The transposition steps are communication intensive, because every processor sends data to all others. We use non-blocking MPI library routines to pass the data between the nodes.

Figure 2 shows the performance per node of the SGI PowerChallenge and the Cray T3E for a varying number of processors. On the T3E we had to run on 4 or more processors in order to fit the data into memory. Although the T3E at NERSC has currently 128 nodes, only a maximum of 64 are available at the moment. To compute the MFLOPS, the operation count for a 1d-FFT of length p was assumed to be .

 
Figure 2:  Floating point performance in MFLOPS/node on the Cray T3E and the SGI PowerChallenge for varying number of processors. The 3d-FFT has a grid size of . The matrix-matrix multiplies are of size .

Both machines fall short of their peak performances of 390 MFLOPS/node (SGI PowerChallenge) and 600 MFLOPS/node (Cray T3E). The FFTs are very memory-access intensive, and are not expected to perform too well on RISC-based machines where the memory access times are much longer than on vector architectures. Indeed we find them to run at about 53 MFLOPS/node on 4 T3E processors, decreasing gradually due to communication to about 36 MFLOPS/node on 64 processors. On the PowerChallenge, they run at about 48 MFLOPS/node on a single processor. When running on more processors, the required communication first reduces the performance on the PowerChallenge. On 8 or 16 processors though, one can see the effects of the increased aggregate cache, which leads to super-linear speedup when running on 16 nodes (54 MFLOPS/node). Notice that the absolute performance numbers one would get from a simple benchmark test on a 1d-FFT are much higher, since our number includes not only the communication overhead, but also time-consuming memory copies for the transpose operation and the time to apply the operator between the inverse and the forward FFT.

On the matrix-matrix multiplies, the Cray excels with a performance between 237 MFLOPS/node (on 4 processors) and 221 MFLOPS/node (on 64 processors). The performance curve of the PowerChallenge has more structure. Like in the case of the 3d-FFT, we see a super-linear speedup for the matrix-matrix multiplies, but this time much more pronounced. This is most likely also cache related. A simple matrix-matrix multiply benchmark reveals that the performance of the PowerChallenge degrades substantially when the shape of the rectangular matrices deviates strongly from the square. This is the case for the matrices multiplied here, because and if is small. Using more processors thus brings the matrices into a more convenient shape. The fact that the Cray T3E is not showing such sensitivity to the shape of the matrix might indicate a superior cache blocking technique of the ZGEMM routine in the LIBSCI.

In summary, we find that on both machines, our implementation shows good parallel performance and demonstrates that plane-wave calculations can make efficient use of parallel platforms with a fast communication system. Further optimization is needed though to increase the single-node performance by avoiding time-consuming memory accesses.



next up previous
Next: References Up: Conjugate-Gradient Based Electronic Structure Previous: A numerical example



Bernd Pfrommer
Tue Jul 22 17:09:54 PDT 1997