What follows describes the use of the reliable stream to implement the
MPI short messages protocol
. When a short message is sent by
the application, it is placed into the reliable message stream. A
sequence number gets assigned to it, and a queue descriptor is pushed
into the U-Net transmission FIFO. The U-Net device immediately
transmits the message. On the receiving side, the U-Net device picks
the message up and places it into the U-Net receive buffer. When the
application on the receiver calls a receiving subroutine, the ADI
layer checks for messages in the receive buffer. If it detects an
out-of-order or duplicate (retransmitted) packet, it immediately sends
out a duplicate acknowledgment
. If the received packet is in order, it is not
acknowledged, but the receive call returns immediately. The
acknowledgments are all piggy backed onto messages, unless they are
triggered by retransmissions or out-of-order packets. This lazy
acknowledgment strategy will work well for small loss rates, and for
applications that exchange a comparable number of messages in a rather
synchronous fashion. It will not work well for a producer-consumer
type of application. In this case, one could configure MPI with an
environment variable to follow an eager acknowledgment protocol.
For simplicity, only a single retransmission timer is maintained for
the packet at the left sender window. Upon retransmission, the timer
is backed off exponentially, clamped by a constant
. Incoming acknowledgments reset the
timer.
A correct MPI program must call MPI_Finalize() before a group member process exits. As mentioned before, our protocol relies on this. Before a process leaves the group, it makes sure that it has received acknowledgments for all messages it has sent, and sends out acknowledgments for all packets it has received. If it has not received acknowledgments for sent-out packets after retransmitting a fixed number of times, the process exits with a warning message, and it is up to the user to make sure the application finished correctly. Assuming that message (and acknowledgment) losses rarely happen, we feel this two-way handshake release protocol is adequate to handle the three-army problem[2]
In summary, we suggest to use a reliable message stream with a ``go-back-n''[2] window-based protocol. The send window is only limited by the size of the U-Net transmission queue, whereas at the receiver, a window size of one at the ADI level discards out-of order packets. At the U-Net level however, the receive window size is given by the size of the U-Net recieve queue, such that packets can arrive in a burst without being dropped.