next up previous
Next: Performance measurements Up: NMR Chemical Shift Calculations Previous: Computational Kernels

Parallel Implementation

Since quantum-chemistry calculations are deployed widely, there have been a number of efforts to parallelize such codes [3], [4], [5]. We take a message passing SIMD approach using Fortran 90 and the MPI library to achieve portability and performance. Whenever possible, we call to vendor-provided, optimized library routines.

The most critical decision is about the data layout. A partitioning by vector indices is the best choice [3]. This means each processor holds a certain range of indices of each vector, e.g. processor 0 holds the first 100 vector components of all m vectors, processor 1 holds components 101 through 200 etc. This will make the calculation of dot products between vectors trivial -- there is only a global summation required after each processor has computed its local part of the dot product. This generalizes easily to the case of multiple dot products, i.e. matrix-matrix multiplies. We perform the global summation with a call to the MPI reduce function.

How do we assign vector indices to processors? To achieve load balance on the dot products, we should have an equal number of vector indices on each processor. Which ones exactly is unimportant to the dot products, but not to the 3d-FFT.

 
Figure 1:  The first stage of the inverse 3d-FFT requires no communication, since the data layout is such that the inverse 1d-FFT along the z-direction is completely local. The grid points inside the sphere correspond to vector indices, those outside are zero at first.

Figure 1 shows the FFT grid, which is occupied with non-zero components only within a sphere at the center of the grid. The grid points inside the sphere are the vector indices in Fourier space. A inverse 3d-FFT is performed by doing an inverse 1d-FFT along the z direction first, then the y direction, and lastly the x direction. Since at first only data points within the small sphere (its diameter is half the size of the grid) are nonzero, we can perform the first stage inverse 1d-FFT just along the vertical colored lines. To make this operation happen completely local, a processor always holds a complete ``line'' along the z-direction. In other words, each processor holds vector indices corresponding to complete line segments inside the sphere. Figure 1 shows some of those as dashed lines inside the sphere. To achieve good load balancing, we follow a suggestion by Andrew Canning and hand out the ``lines'' to the processors such that the processors get a balanced number of short and long lines (the length of a line depends on the place where the line pierces the sphere). The inverse 1d-FFTs are computed by calls to optimized library routines.

 
Figure 2:  After the first stage of the inverse 3d-FFT, the data has spread out to form a cylinder. A transposition of the grid leads to the data layout shown here, and the inverse 1d-FFT along the y-direction can be done.

After the inverse FFT along z, the data has spread out, and now occupies a cylinder inside the grid (Figure 2). At this point, each processor communicates a single message to all the other processorsgif to achieve a transposition of the grid, and to arrive at the data layout shown in Figure 2. Then the inverse 1d-FFT along the y-direction is a local operation.

 
Figure 3:  Third stage of the inverse 3d-FFT. After the second stage, all data points within a parallel epiped are nonzero, and a second transpose brings the data into the form shown here. Now, the final inverse 1d-FFT along the x-direction can be done locally.

After the second stage, all data points within a parallel epiped are nonzero, and a second transpose brings the data into the form shown in Figure 3. Now, the final inverse 1d-FFT along the x-direction can be done locally. Since the Fourier grid sizes are preferably powers of two, and so are often the number of processors, the second transpose does not always require communication. In the special case where the grid size in z-direction is an integral multiple of the number of processors, the transpose reduces to a local memory copy.

After the last stage of the inverse 3d-FFT is performed, and the vector is present in realspace, we directly multiply point-by-point with the local potential operator , and immediately perform the forward 1d-FFT in x-direction, which is the first stage of the forward 3d-FFT that brings the vector back to Fourier space. By combining these operations, we can keep the data in cache as long as possible. The remaining two stages of the forward 3d-FFT are analogous to those of the inverse 3d-FFT, and will not be described here.

We conclude this section by summarizing the merits of hand-coding the 3d-FFT based on 1d-FFTs as opposed to using vendor-provided 3d-FFT routines:



next up previous
Next: Performance measurements Up: NMR Chemical Shift Calculations Previous: Computational Kernels



Bernd Pfrommer
Mon May 26 12:08:17 PDT 1997