next up previous
Next: Application Up: NMR Chemical Shift Calculations Previous: Parallel Implementation

Performance measurements

In this section, we present performance numbers for the computational kernels, that is the 3d-FFT and the matrix-matrix multiplies. For any system larger than about 5 atoms, more than 95% of the time is spent inside these kernels. We gauge their speed on our three production platforms: The SGI PowerChallenge, the Cray T3E, and the IBM SP2. All are based on fast, super-scalar RISC processors: The 16 nodes of the PowerChallenge (at NCSA in Illinois) are equipped with 195 MHz MIPS R10000 processors, whereas the 128 nodes of the T3E have a 300 MHz Alpha EV5 chip inside. On the IBM SP2, we use the ``thin'' nodes with the 66.7 MHz Power2 chip. In contrast to the PowerChallenge, the T3E and SP2 are scalable parallel computers.

As far as the memory hierarchies are concerned, the SGI Power Challenge has a unified 2 MB floating point data cache per node, which is also used as a secondary cache for instructions and integer data. The nodes go through a wide, fast, shared snoopy bus to access the global shared memory. The T3E on the other hand is a distributed memory machine, where the individual nodes have 256 MB of memory, and the EV5 Chip has a 8 KB direct-mapped L1 and a 96 KB 3-way set associative L2 cache. The T3E nodes communicate through a scalable high-bandwidth low-latency interconnect. The IBM also is a distributed memory machine, but has a fat-tree high-performance network connecting its (``thin'') nodes. Those have a Power2 chip with a 64KB L1 cache and no secondary cache. In contrast to the ``thick'' nodes, the bus to local memory is only 64 bits wide.

As a test problem, we took a sample of amorphous hydrogenated carbon with 64 carbon and 12 hydrogen atoms. We use a plane-wave cutoff of 90 Rydbergs which leads to a matrix size of N=57000, and the number of eigenvectorsgif is m=21.

The inverse and forward 3d-FFT involved in applying to a wave function requires a grid of . The 1d-FFTs are performed by calls to the LIBSCI (Cray), COMPLIB (SGI), and ESSL (IBM) library routines. On all three machines, the native MPI implementations are used.

Figure 4 shows the performance per node for a varying number of processors. On the T3E we had to run on 4 or more processors in order to fit the data into memory. Although the T3E at NERSC has currently 128 nodes, only a maximum of 64 are available at the moment, and only the 32 node queues are conveniently accessible. To compute the MFLOPS, the operation count for a 1d-FFT of length p was assumed to be .

 
Figure 4:  Floating point performance in MFLOPS/node on the Cray T3E, the SGI PowerChallenge, and the IBM SP2 for varying number of processors. The 3d-FFT has a grid size of . The matrix-matrix multiplies are of size .

All machines fall short of their peak performances of 390 MFLOPS/node (SGI PowerChallenge), 600 MFLOPS/node (Cray T3E), and 266 MFLOPS/node (IBM SP2). The FFTs are very memory-access intensive, and are not expected to perform too well on RISC-based machines where the memory access times are much longer than on vector architectures. Indeed we find them to run at about 76 MFLOPS/node on 4 T3E processors, decreasing gradually due to communication to about 62 MFLOPS/node on 32 processors. On the PowerChallenge, they run at about 48 MFLOPS/node on a single processor. When running on more processors, the required communication first reduces the performance on the PowerChallenge. On 8 or 16 processors though, one can see the effects of the increased aggregate cache, which leads to super-linear speedup when running on 16 nodes (54 MFLOPS/node). The performance curve of the SP2 has a characteristic similar to the one of the T3E. It starts out at about 45 MFLOPS/node on two nodes, and decreases steadily to about 26 for 32 processors. Notice that the absolute performance numbers one would get from a simple benchmark test on a 1d-FFT are much higher, since our number includes not only the communication overhead, but also time-consuming memory copies for the transpose operation and the time to apply the operator between the inverse and the forward FFT. The 3d-FFT intrinsically implies a large amount of communication, and requires a parallel supercomputer as opposed to a cluster of workstations.

On the matrix-matrix multiplies, the Cray excels with a performance between 242 MFLOPS/node (on 4 processors) and 236 MFLOPS/node (on 32 processors). The IBM slows down from 124 MFLOPS/node (2 processors) to 113 MFLOPS/node (32 processors) in a similar fashion. The performance curve of the PowerChallenge has more structure. Like in the case of the 3d-FFT, we see a super-linear speedup for the matrix-matrix multiplies, but this time much more pronounced. This is most likely also cache related. A simple matrix-matrix multiply benchmark reveals that the performance of the PowerChallenge degrades substantially when the shape of the rectangular matrices deviates strongly from the square. This is the case for the matrices multiplied here, because and if is small. Using more processors thus brings the matrices into a more convenient shape. The fact that the Cray T3E and the IBM SP2 are not showing such sensitivity to the shape of the matrix might indicate a superior cache blocking technique of the ZGEMM routine in LIBSCI and ESSL.



next up previous
Next: Application Up: NMR Chemical Shift Calculations Previous: Parallel Implementation



Bernd Pfrommer
Mon May 26 12:08:17 PDT 1997