Intel® MPI Benchmarks 3.2.4
Intel® MPI Benchmarks implements blocking and non-blocking modes of the IMB-IO benchmarks as different benchmark flavors. The Read and Write components of the blocking benchmark name are replaced for non-blocking flavors by IRead and IWrite, respectively.
The definitions of blocking and non-blocking flavors are identical, except for their behavior in regard to:
Aggregation. The non-blocking versions only run in the non-aggregate mode.
Synchronism. Only the meaning of an elementary transfer differs from the equivalent blocking benchmark.
Basically, an elementary transfer looks as follows:
time = MPI_Wtime() for ( i=0; i<n_sample; i++ ) { Initiate transfer Exploit CPU Wait for the end of transfer } time = (MPI_Wtime()-time)/n_sample
The Exploit CPU section in the above example is arbitrary. Intel® MPI Benchmarks exploits CPU as described below.
Intel® MPI Benchmarks uses the following method to exploit the CPU. A kernel loop is executed repeatedly. The kernel is a fully vectorizable multiplication of a 100x100 matrix with a vector. The function is scalable in the following way:
CPU_Exploit(float desired_time, int initialize);
The input value of desired_time determines the time for the function to execute the kernel loop, with a slight variance. At the very beginning, the function is called with initialize=1 and an input value for desired_time. This determines an Mflop/s rate and a timing t_CPU, as close as possible to desired_time, obtained by running without any obstruction. During the actual benchmarking, CPU_Exploit is called with initialize=0, concurrently with the particular I/O action, and always performs the same type and number of operations as in the initialization step.
Three timings are crucial to interpret the behavior of non-blocking I/O , overlapped with CPU exploitation:
t_pure is the time for the corresponding pure blocking I/O action, non-overlapping with CPU activity
t_CPU is the time the CPU_Exploit periods (running concurrently with non-blocking I/O) would use when running dedicated
t_ovrl is the time for the analogous non-blocking I/O action, concurrent with CPU activity (exploiting t_CPU when running dedicated)
A perfect overlap means: t_ovrl = max(t_pure,t_CPU)
No overlap means: t_ovrl = t_pure+t_CPU.
The actual amount of overlap is:
overlap=(t_pure+t_CPU-t_ovrl)/min(t_pure,t_CPU)(*)
The Intel® MPI Benchmarks result tables report the timings t_ovrl, t_pure, t_CPU and the estimated overlap obtained by the (*) formula above. At the beginning of a run, the Mflop/s rate is corresponding to the t_CPU displayed.