why the result of KI slower than the result of SA ? #3

Hannah-xxl · 2018-04-03T01:29:03Z

As described in figure 8 of Offloading communication control logic in GPU accelerated applications article, KI model is faster than SA model. But I use libmp benchmark mp_pingpong_all in my ubuntu with P4 gpu and mlx5 nic, I get a result showing KI is almost double latency of SA. So, I wonder if the result of this article is not tested under the benchmark of libmp? If yes, what test samples dose the article use ?

e-ago · 2018-04-04T16:19:24Z

In that paper we used a very simple pingpong that we reworked and transformed in the more complete and realistic mp_pingpong_all for the GTC conference.
I’m not sure about your command line but running mp_pingpong_all with a fixed_time() function of about 5us (env var KERNEL_TIME=5) on two Broadwell nodes with Tesla P100 (Pascal arch), Mellanox OFED 4.2, NVIDIA display driver 396, performance are:

On two IvyBridge nodes with Tesla K40m (Kepler arch), Mellanox OFED 4.2, NVIDIA display driver 396, performance are:

As you may notice, KI latency is always <= SA latency.

Without the work simulated by fixed_time() function (i.e. KERNEL_TIME=0, having send/recv only) SA/KI pingpong is always slower than MPI pingpong because you are introducing the additional overhead of the GPU without hiding any kernel launch latency.
As we discussed during our GTC talk, GPUDirect Async is not about faster communications; it is about moving control of communications from the CPU to the GPU. For this reason, Async can represent a gain in performance if, in your MPI application, the GPU is idle waiting for CPU work.

Anyway a single pingpong with only 2 processes is not the best example to take advantage of the Kernel-Initiated features.

Hannah-xxl · 2018-04-08T09:01:30Z

I didn't specify any parameters or envs of mp_pingpong_all. It use KERNEL_TIME=20 for default, and 1 P4 GPU for each process in my server.
I set KERNEL_TIME=5 as you said, but the result of KI, not KI+Kern, is slower than the result of Async, not Async+Kern. The result of KI+Kern is almost the same with Async+Kern when KERNEL_TIME=5. But when setting KERNEL_TIME=0, the mode of Async+Kern is much faster.

Hannah-xxl closed this as completed Apr 10, 2018

Hannah-xxl reopened this Apr 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why the result of KI slower than the result of SA ? #3

why the result of KI slower than the result of SA ? #3

Hannah-xxl commented Apr 3, 2018 •

edited

Loading

e-ago commented Apr 4, 2018

Hannah-xxl commented Apr 8, 2018

why the result of KI slower than the result of SA ? #3

why the result of KI slower than the result of SA ? #3

Comments

Hannah-xxl commented Apr 3, 2018 • edited Loading

e-ago commented Apr 4, 2018

Hannah-xxl commented Apr 8, 2018

Hannah-xxl commented Apr 3, 2018 •

edited

Loading