With respect to the networking hardware, we assume that eventually ATM
networks will link the processing units with a very low error rate
such that lost packets are a rare event. Assuming a bit error
rate of 10
and a packet size of about 1kB, there will be a
packet loss at most every 10
packets. For example,
an additional overhead of say 1
s/packet due to
a more sophisticated protocol will add up to a full second of overhead
per dropped packet. For this reason, we think that on low error rate
networks, minimal protocol overhead is an important design goal. While
we make the common case go fast, we must also ensure that the rare
case executes correctly. Assuming a correct MPI application, our
protocol should handle all failure modes correctly
.