why exclude CUDA write lat test? #276

yangrudan · 2024-08-07T06:11:06Z

Can you exlain the reason for excluding CUDA write lat test? On the other hand, RDMA write lat test is ok.

Line 1748 in 0e0e252

    
           fprintf(stderr,"Perftest supports CUDA latency tests with read/send verbs only\n");

drossetti · 2024-08-07T17:10:54Z

For send_lat, the CPU can poll on the CQ irrespective of where the RX buffer is placed.
For write_lat instead, when using RDMA_WRITE, there are no CQEs on the RX side and the CPU cannot (officially) poll on the received data since that sits on CUDA device memory.
Considering #230, we could look at enabling CUDA when RDMA_WRITE_WITH_IMM is selected.

yangrudan · 2024-08-08T06:11:46Z

Thanks for your answer. @drossetti

By the way,

wheather CUDA write latency value double than RDMA write latency value in small payload?
If this phenomenon is normal, why is there such a difference between CUDA and RDMA?

yangrudan · 2024-12-20T09:14:04Z

Hi, when I enabe CUDA with RDMA_WRITE_WITH_IMM , I still meet this warning,
Can you figure out the reason of this?

(base) root@NH-DC-NM129-I06-12U-GPU-246:~/yangrudan/cq/perftest# ./ib_write_lat -d  mlx5_cx6_0 -a -F --report_gbit --write_with_imm --use_cuda=0
---------------------------------------------------------------------------------------
Perftest supports CUDA latency tests with read/send verbs only

By the way , in cpu side is fine:

(base) root@NH-DC-NM129-I06-12U-GPU-246:~/yangrudan/cq/perftest# ./ib_write_lat -d  mlx5_cx6_0 -a -F --report_gbit --write_with_imm

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write_imm Latency Test
 Dual-port       : OFF          Device         : mlx5_cx6_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: OFF          Lock-free      : OFF
 ibv_wr* API     : ON           Using DDP      : OFF
 RX depth        : 512
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 220[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x27e6 PSN 0x195edb
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:16:03:16
 remote address: LID 0000 QPN 0x3c99 PSN 0x45ccc2
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:16:04:04
---------------------------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec] 
 2       1000          5.49           12.27        5.69                5.65             0.00            5.81                    12.27  
 4       1000          5.49           8.09         5.55                5.55             0.03            5.60                    8.09   
 8       1000          5.51           7.09         5.55                5.55             0.03            5.61                    7.09   
 16      1000          5.50           8.91         5.55                5.56             0.00            5.61                    8.91   
 32      1000          5.53           16.20        5.59                5.59             0.00            5.64                    16.20  
 64      1000          5.55           6.95         5.60                5.60             0.00            5.66                    6.95   
 128     1000          5.59           6.57         5.65                5.65             0.00            5.70                    6.57   
 256     1000          6.26           7.41         6.33                6.33             0.00            6.51                    7.41   
 512     1000          6.35           8.46         6.41                6.43             0.05            6.64                    8.46   
 1024    1000          6.45           8.25         6.51                6.52             0.03            6.72                    8.25   
 2048    1000          6.63           8.59         6.70                6.72             0.00            6.89                    8.59   
 4096    1000          6.99           9.31         7.15                7.14             0.03            7.29                    9.31   
 8192    1000          7.18           10.01        7.35                7.35             0.03            7.54                    10.01  
 16384   1000          7.58           8.82         7.80                7.80             0.00            7.84                    8.82   
 32768   1000          8.20           10.46        8.45                8.44             0.00            8.67                    10.46  
 65536   1000          9.61           11.63        9.75                9.75             0.00            9.83                    11.63  
 131072  1000          12.27          15.31        12.48               12.47            0.00            12.64                   15.31

mrgolin · 2024-12-22T11:00:33Z

We can use ctx->memory->copy_buffer_to_host for data polling in run_iter_lat_write(). It should already point to the right implementations for reading from HBM.

yangrudan · 2024-12-22T12:51:14Z

We can use ctx->memory->copy_buffer_to_host for data polling in run_iter_lat_write(). It should already point to the right implementations for reading from HBM.

Thanks for your reply. Maybe using ctx->memory->copy_buffer_to_host seems like not suitable for write latency test? (I don't quite understand. Can you explain more?)

Actually， my question is why CUDA can not run run_iter_lat_write_imm() to test write latency?

yangrudan closed this as completed Aug 29, 2024

yangrudan reopened this Dec 20, 2024

yangrudan closed this as completed Dec 20, 2024

yangrudan reopened this Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why exclude CUDA write lat test? #276

why exclude CUDA write lat test? #276

yangrudan commented Aug 7, 2024

drossetti commented Aug 7, 2024

yangrudan commented Aug 8, 2024

yangrudan commented Dec 20, 2024

mrgolin commented Dec 22, 2024

yangrudan commented Dec 22, 2024

why exclude CUDA write lat test? #276

why exclude CUDA write lat test? #276

Comments

yangrudan commented Aug 7, 2024

drossetti commented Aug 7, 2024

yangrudan commented Aug 8, 2024

yangrudan commented Dec 20, 2024

mrgolin commented Dec 22, 2024

yangrudan commented Dec 22, 2024