To assess the performance of our implementation, we ran a ping pong
benchmark between two hosts, and measured the round trip time for a
four byte long message (twenty bytes long including the MPI
header). We find latencies of about 32
s. Timing the raw
U-Net ping pong benchmark revealed that the ADI layer is responsible
for about 11 of the 32
s. A more detailed timing of our ADI code
was severely hampered by the lack of accurate timers. Nevertheless, it
appears that the overhead added by the reliability protocol is only
about 3-4
s. Another 7-8
s are spent for searching the MPI posted
and unexpected queue, for message handling, and for overhead incurred
by the multiple device support.
Our first naive implementations showed a latency of 60
s, which under
great efforts we curbed to 32
s. There might still be considerable
room for improvement by optimizing the queue searches and hand-coding
the most time critical routines in assembly language.