Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why the result of KI slower than the result of SA ? #3

Open
Hannah-xxl opened this issue Apr 3, 2018 · 2 comments
Open

why the result of KI slower than the result of SA ? #3

Hannah-xxl opened this issue Apr 3, 2018 · 2 comments

Comments

@Hannah-xxl
Copy link

Hannah-xxl commented Apr 3, 2018

As described in figure 8 of Offloading communication control logic in GPU accelerated applications article, KI model is faster than SA model. But I use libmp benchmark mp_pingpong_all in my ubuntu with P4 gpu and mlx5 nic, I get a result showing KI is almost double latency of SA. So, I wonder if the result of this article is not tested under the benchmark of libmp? If yes, what test samples dose the article use ?

@e-ago
Copy link
Collaborator

e-ago commented Apr 4, 2018

In that paper we used a very simple pingpong that we reworked and transformed in the more complete and realistic mp_pingpong_all for the GTC conference.
I’m not sure about your command line but running mp_pingpong_all with a fixed_time() function of about 5us (env var KERNEL_TIME=5) on two Broadwell nodes with Tesla P100 (Pascal arch), Mellanox OFED 4.2, NVIDIA display driver 396, performance are:
pp_brdw

On two IvyBridge nodes with Tesla K40m (Kepler arch), Mellanox OFED 4.2, NVIDIA display driver 396, performance are:
pp_ivy

As you may notice, KI latency is always <= SA latency.

Without the work simulated by fixed_time() function (i.e. KERNEL_TIME=0, having send/recv only) SA/KI pingpong is always slower than MPI pingpong because you are introducing the additional overhead of the GPU without hiding any kernel launch latency.
As we discussed during our GTC talk, GPUDirect Async is not about faster communications; it is about moving control of communications from the CPU to the GPU. For this reason, Async can represent a gain in performance if, in your MPI application, the GPU is idle waiting for CPU work.

Anyway a single pingpong with only 2 processes is not the best example to take advantage of the Kernel-Initiated features.

@Hannah-xxl
Copy link
Author

I didn't specify any parameters or envs of mp_pingpong_all. It use KERNEL_TIME=20 for default, and 1 P4 GPU for each process in my server.
I set KERNEL_TIME=5 as you said, but the result of KI, not KI+Kern, is slower than the result of Async, not Async+Kern. The result of KI+Kern is almost the same with Async+Kern when KERNEL_TIME=5. But when setting KERNEL_TIME=0, the mode of Async+Kern is much faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants