Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iperf3: connection is getting closed from the client side unexpectedly #1756

Open
Ankita13-code opened this issue Sep 3, 2024 · 9 comments

Comments

@Ankita13-code
Copy link

Ankita13-code commented Sep 3, 2024

NOTE: The iperf3 issue tracker is for registering bugs, enhancement
requests, or submissions of code. It is not a means for asking
questions about building or using iperf3. Those are best directed
towards the Discussions section for this project at
https://github.com/esnet/iperf/discussions
or to the iperf3 mailing list at [email protected].
A list of frequently-asked questions
regarding iperf3 can be found at http://software.es.net/iperf/faq.html.

Context

iperf3 commands are failing with the following error: the client has unexpectedly closed the connection.

  • Version of iperf3: 3.17.1

  • Hardware:

  • Operating system (and distribution, if any): Azure Linux/CBL-Mariner and SUSE Linux

Please note: iperf3 is supported on Linux, FreeBSD, and macOS.
Support may be provided on a best-effort basis to other UNIX-like
platforms. We cannot provide support for building and/or running
iperf3 on Windows, iOS, or Android.

  • Other relevant information (for example, non-default compilers,
    libraries, cross-compiling, etc.):

Please fill out one of the "Bug Report" or "Enhancement Request"
sections, as appropriate. Note that submissions of bug fixes, new
features, etc. should be done as a pull request at
https://github.com/esnet/iperf/pulls

Bug Report

  • Expected Behavior
    The lisa test case perf_udp_iperf_sriov uses iperf3 to test the network throughput between 2 Azure Linux VMs. This test should pass for Azure Linux VMs.
  • Actual Behavior
    The test is intermittently failing with the following error - error: the client has unexpectedly closed the connection.
  • Steps to Reproduce
    -- Spin up 8 iperf servers at the same time in parallel (we are using port 750-757 here) - sudo iperf3 -s -1 -J -i10 -f g -p <port>
    -- Now run 8 clients in parallel with the following parameters - sudo iperf3 -t 10 -c <server_ip> -u -J -i1 -f g -p <port> -P 64 -l 8192 -4

When these commands run for the last port (757 in this case) the error occurs.

  • Possible Solution

Enhancement Request

  • Current behavior

  • Desired behavior

  • Implementation notes

@davidBar-On
Copy link
Contributor

The test is intermittently failing with the following error - error: the client has unexpectedly closed the connection

This error means that the client terminated before the normal end of the test, or that for some reason the network interface with the client was disconnected, etc. Therefore, the Client's output messages should be given to try to better understand the cause of the problem.

@Ankita13-code
Copy link
Author

Ankita13-code commented Sep 4, 2024

Hi @davidBar-On,
here are the logs for both server as well as client - logs.zip

@davidBar-On
Copy link
Contributor

It seems that the problem in is that the server "received" the client's close of the control channel, before the "done" message (IPERF_DONE) sent by the client arrived to the server. Since this is not really an error, I submitted PR #1765 that does not issue an error in this case (only prints a warning message).

@Ankita13-code, can you try building using the PR's code to make sure it solves the problem?

@Ankita13-code
Copy link
Author

Ankita13-code commented Sep 26, 2024

Hi @davidBar-On , I tried building using the PR's code, but it did not solve the problem. Rather the race condition was hit more frequently in this case. Here are the logs - logs - iperf.zip

@davidBar-On
Copy link
Contributor

davidBar-On commented Sep 27, 2024

Hi @Ankita13-code, I now see that PR #1765 is probably not relevant, as it handles failure at the end of the test, while it seems that the problem is while connecting the streams at the beginning of the test.

Comparing the cookies of the two failed servers to the cookies sent by the clients, I don't find the relevant clients. You can see that the last client's cookie is uce67qp2fbr7tey366hbbs3mmvum7ncqjb6k, and both the client and server for this test completed successfully. However, the client log does not include the log for the two failed server tests - cookies fj2p5c6ouayhve66zknl24wcw6mwupik7yhg and cfqzkwvp62awbs4cvosxs4exvmckcqumhab4.

Can you try to understand which clients performed the failed tests and send server and client logs for failed test(s)?

@Ankita13-code
Copy link
Author

Hi @davidBar-On. I reran the test and found that there are 2 extra cookies in the server logs which are absent in the client logs as you mentioned. These are the clients where the test failed -

iperf3 -t 10 -c 10.0.0.4 -u -J -i1 -f g -p 754 -P 64 -l 8192 -4
iperf3 -t 10 -c 10.0.0.4 -u -J -i1 -f g -p 760 -P 64 -l 8192 -4

The client logs for both of them are as follows -
client-760:
warning: Report format (-f) flag ignored with JSON output (-J)
warning: UDP block size 8192 exceeds TCP MSS 1398, may result in fragmentation / drops
execution time: 36.453 sec, exit code: -1

client-754:
warning: Report format (-f) flag ignored with JSON output (-J)
warning: UDP block size 8192 exceeds TCP MSS 1398, may result in fragmentation / drops
execution time: 57.853 sec, exit code: -1

The server logs for these were as follows -
server-760.txt
server-754.txt

It seems like the test is getting closed from client side quite early as compared to the server side. For e.g.- In case of PORT 760, the client side closes at almost 36 sec while the server side closes at 76 sec.

NOTE: There are no other logs present for the clients corresponding to these servers.

@davidBar-On
Copy link
Contributor

Hi, the first logs you sent show that the server and client are running using some "lisa" tool/script. However, it seems that these two "missing" clients were not running through "lisa": -l is 8192 instead of 1024 in the original script and there a no logs. Does it mean that these clients were run outside of the "lisa" context, or even in parallel to "lisa" tests?

Also, if there are not clients logs, were did you take the information about them that you included in the comment? Were do you see that "... the client side closes at almost 36 sec ..."?

Regarding the client running for 36 seconds. The test is only for 10 seconds (-t 0), so how can it be that the client run for 36 seconds?

In general, it seems that the problem is in the client. My guess that it is because of overloading the system with many parallel streams (and parallel tests?), and that it is not iperf3 issue. In any case, the logs for the failed clients are required for further analysis. It would be helpful if these clients can run using the options -V --debug=3.

@Ankita13-code
Copy link
Author

Hi @davidBar-On,
It is true that the servers and clients are being run using LISA tool. The possibility of the 2 missing clients being not run by LISA seems low since this is a complete test run by LISA where it tests the system by opening multiple ports and increasing the -l parameter progressively.

As I already mentioned in my previous comment, in the name of client logs we only have these in the logs -
client-760:
warning: Report format (-f) flag ignored with JSON output (-J)
warning: UDP block size 8192 exceeds TCP MSS 1398, may result in fragmentation / drops
execution time: 36.453 sec, exit code: -1

client-754:
warning: Report format (-f) flag ignored with JSON output (-J)
warning: UDP block size 8192 exceeds TCP MSS 1398, may result in fragmentation / drops
execution time: 57.853 sec, exit code: -1

I quoted the execution time of these 2 commands as the time the client runs for. Since for the client 760 the execution time is 36 sec as shown above, I quoted that number. However, if you check the logs for server 760, the execution time seems to be around 76 sec.

I'll try to run these with the debug parameter as suggested. Also, a thing to note is, these tests started failing mainly 3 months ago and were passing before that. Even now the failure is sort of intermittent.

@davidBar-On
Copy link
Contributor

davidBar-On commented Nov 11, 2024

Hi @Ankita13-code,

I'll try to run these with the debug parameter as suggested.

That should be helpful. Note that the debug messages adds some overhead which reduces performance/throughput and may cause issues for the tests with many parallel streams.

Also, a thing to note is, these tests started failing mainly 3 months ago and were passing before that. Even now the failure is sort of intermittent.

Do you know what changes were made about 3 months ago that could cause this problem? E.g. iperf3 or LISA version change, network architecture change, etc.

Checking the LISA related code, I see that iperf3.py and other files were changed on July 15 by Commit 595b040 which "add use shell, to make sure tool path is available". The actual use of the shell parameter seems to be here in process.py.

I suggest to check whether the added sh -c caused the problem, as he added shell per command adds some overhead.. Therefore, if possible, run the test with a LISA version prior to July 15, to see if the issue also exists without the added sh -c.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants