The enormous increase in available computing power is certainly one of the key factors for the recent advances in computational condensed matter theory. Using density functional theory (DFT) in the local density approximation (LDA) or combined with generalized gradient approximations (GGAs), it is now possible to compute with sufficient accuracy the ground state properties of quantum-mechanical systems up to hundreds of atoms large. A wealth of information can be extracted from these CPU-intensive calculations, e.g. binding energies, stable configurations, magnetic and dielectric properties, and insight into electronic properties such as the charge density, band structure, and the quantum-mechanical wave functions.
To reduce the computational effort, we use pseudopotentials to
represent the atomic nucleus and the core electrons, leaving only
the set of m valence wave functions
to be dealt
with. Those are
determined by the requirement that they minimize the DFT/LDA energy
functional
subject to the constraint that the
be mutually
orthogonal and normalized.
With a suitable construction of the pseudopotentials, the wave
functions in (1) can be represented in a plane-wave
basis of size N, thus turning (1) into a minimization
problem of dimension Nm, with N often as large as
, but m being substantially smaller, typically 1/1000 to 1/100
of N.
Fortunately, the derivative
(not including the constraints!) is available to aid the search for
the minimum.
In a plane-wave basis, the Hamiltonian operator
turns into
the (
) matrix
H, which can easily become too large to fit into memory. However,
most iterative schemes require only a rule describing how to apply
to a
given wave function, thus removing the need for explicit storage of
H. The separable Kleinman-Bylander pseudopotentials allow for a
particularly efficient representation of
, namely a
decomposition of the form:
In Eq. (3), the first term
is the kinetic energy
operator, which is proportional to the Laplacian and therefore
diagonal in a plane wave basis
. However,
it grows like
with the wave number
,
and gives the matrix H its characteristic diagonally-dominant
shape. It also motivates our convention for arranging the plane wave
basis according to increasing G. That way, the lower right corner
of H will be dominated by the diagonal elements arising from
. The second so-called nonlocal term in (3)
consists of a sum of
projectors
onto
one-dimensional subspaces, and requires only
operations to be applied to a set of trial wave functions. Finally,
the local potential term
in (3) is diagonal
in a realspace basis. Therefore, it is applied to a trial wave function
by first inverse Fourier transforming
(operation count
), then multiplying with a
diagonal operator in realspace (operation count
), and
finally Fourier transforming back to the plane wave basis. Since
, and m and N are proportional to the number of
atoms
, the second term applied to a set of wave functions
requires
operations
.
The local term only scales like
, but has a large prefactor in front, such that it
dominates the computational cost for systems with less than 50
atoms.
Given the gradient (2), the minimization problem (1) can be tackled with a standard optimization scheme. The use of the popular conjugate gradients method is impeded by the need to include the orthonormality constraints[2]. We follow the work of Mauri et al [1] and use a functional with implicit orthonormality constraints, which allows us to use a textbook conjugate gradients scheme. In section 3, we exploit this to assess the convergence properties of the algorithm for a representative, but small model problem. A numerical example verifies our analysis and demonstrates the effect of preconditioning.
Even with the efficient representation (3), the task of minimizing E in (1) is formidable. Only the use of parallel computers will allow to explore the very large, interesting systems such as surfaces, interfaces, alloys, amorphous solids, liquids, and large molecules. There is agreement that the best way to parallelize the minimization problem is by partitioning all vector quantities (i.e. wave functions) according to their plane wave basis indices[4]. Then, dot products between two vectors are completely local, and only require a global summation at the end. Concerning the parallel performance, the challenge lies in the Fourier transform. We will show that even with an MPI explicit message passing approach, a high-bandwidth low-latency interconnect like the one of the Cray T3E is necessary to give good application performance with a larger number of processors. We also compare the performance of the T3E to the one of the SGI PowerChallenge as a prototypical symmetric multiprocessor.