-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the reliability of rosbag2 tests #1796
Conversation
Pulls: #1796 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall lgtm, but one comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, @MichaelOrlov can you have a look too? (i do not have maintainer access on rosbag2)
I will take a look. Anyway, hold off from the merging this for a while. We are trying to make a relase on rolling #1791 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clalancette Thanks for trying to make rosbag2 tests less flaky.
I have a concern about adding Wait for topics to be discovered commit and would like to request to revert it along with the 02d5bba
The reason for removal request is because it is really redundant IMO and not robust.
- Redundant because in all places we already have either
ASSERT_TRUE(pub_manager.wait_for_matched(expecting_topic_name);
orASSERT_TRUE(client_manager_1->wait_for_service_to_be_ready());
. Which is making sure that recorder have already been subscribed to the topics on which we are going to publish. - Not robust because in those newly added wait condition - there is no check for the topic names on which recorder subscribed. There is only check for the number of subscriptions. And had assumption that
// /rosout
and// /events/write_split
topics will always be present and counted, which is not necessarily may be true for all CI configurations and underlying middleware layers.
If aforementioned ASSERT_TRUE(pub_manager.wait_for_matched(expecting_topic_name);
and ASSERT_TRUE(client_manager_1->wait_for_service_to_be_ready());
checks are not enough we have a better way to do a wait for the recorder discovery. In some unit tests we use MockRecorder
with extended utility methods to facilitate waiting functionality.
For instance:
ASSERT_TRUE(recorder->wait_for_topic_to_be_discovered(string_topic));
ASSERT_FALSE(recorder->topic_available_for_recording(string_topic));
Please see implementation details in the
rosbag2/rosbag2_transport/test/rosbag2_transport/mock_recorder.hpp
Lines 41 to 73 in 88c51a1
template<typename DurationRepT = int64_t, typename DurationT = std::milli> | |
bool wait_for_topic_to_be_discovered( | |
const std::string & topic_name_to_wait_for, | |
std::chrono::duration<DurationRepT, DurationT> timeout = std::chrono::seconds(10)) | |
{ | |
bool discovered = false; | |
using clock = std::chrono::steady_clock; | |
auto start = clock::now(); | |
do { | |
auto topic_names_and_types = this->get_topic_names_and_types(); | |
for (const auto &[topic_name, topic_types] : topic_names_and_types) { | |
if (topic_name_to_wait_for == topic_name) { | |
discovered = true; | |
break; | |
} | |
} | |
std::this_thread::sleep_for(std::chrono::milliseconds(20)); | |
} while (!discovered && (clock::now() - start) < timeout); | |
return discovered; | |
} | |
bool topic_available_for_recording(const std::string & topic_name) | |
{ | |
bool available_for_recording = false; | |
auto topics_to_subscribe = this->get_requested_or_available_topics(); | |
for (const auto & topic_and_type : topics_to_subscribe) { | |
if (topic_and_type.first == topic_name) { | |
available_for_recording = true; | |
break; | |
} | |
} | |
return available_for_recording; | |
} |
rosbag2_transport/test/rosbag2_transport/record_integration_fixture.hpp
Outdated
Show resolved
Hide resolved
rosbag2_tests/test/rosbag2_tests/test_rosbag2_cpp_get_service_info.cpp
Outdated
Show resolved
Hide resolved
This has almost exactly the same functionality as wait_for_condition, except for two things: 1. It is templated on the Timeout type. 2. It calls rclcpp::shutdown after the loop completes. However, neither of those is necessary; all callers to it use a std::chrono::duration, and all of the test fixtures already call rclcpp::shutdown. Thus, just remove it and make all callers use wait_for_condition instead. Signed-off-by: Chris Lalancette <[email protected]>
That is, we really don't actually want to do a full rclcpp shutdown here; we only want to stop spinning. Accomplish that with an executor, and timing out every 100 milliseconds to check if we are done yet. Signed-off-by: Chris Lalancette <[email protected]>
Make sure it only spins as long as we haven't shutdown, and that it wakes up every so often to check that fact. Signed-off-by: Chris Lalancette <[email protected]>
The main reason for that is that these tests generally want to test certain expectations around how many messages were received. However, if discovery takes longer than we expect, then it could be the case that we "missed" messages at the beginning because discovery hadn't yet completed. Fix this by just waiting around for the recorder to get all the subscriptions it expects before moving on with the test. Signed-off-by: Chris Lalancette <[email protected]>
Signed-off-by: Chris Lalancette <[email protected]>
The problem is that this is not always true. There are at least two situations this does not cover: for the event topic associated with services (
Additionally, That's why adding in the additional waits for recorder subscriptions makes the tests more reliable, since it handles both of these situations.
But fair enough, I'll take a look at switching to this. |
Signed-off-by: Chris Lalancette <[email protected]>
02d5bba
to
11bb864
Compare
All right, see 11bb864, which uses the MockRecorder methods instead. |
Signed-off-by: Chris Lalancette <[email protected]>
Pulls: #1796 |
Signed-off-by: Chris Lalancette <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, waiting for "/events/write_split" is redundant since it is an internal service in the rosbag2_recorder and it is not participating in the test scenarios where it was added.
Also found a few nitpicks.
Co-authored-by: Michael Orlov <[email protected]> Signed-off-by: Chris Lalancette <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now.
Thanks!
Pulls: #1796 |
I'm curious what a "surprise" failure with rmw_connextdds on CI is. |
Yeah, I don't know what is going on with that. The only change to that test is to wait for the clock topic to show up. I'll need to look deeper into it, though, since that seems to be a new failure with this PR. |
That's because there is currently at least one bug associated with spin_some in rclcpp. However, it turns out that we don't even need to use it, as we can just as easily use spin() along with exec.cancel(). Signed-off-by: Chris Lalancette <[email protected]>
Pulls: #1796 |
The only failures here are on Windows, and that has nothing to do with this PR (those will be fixed by ros2/common_interfaces#256). So I think this is ready for final review now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clalancette Curious what was behind decision to move explict call for the stop_spinning()
to the TearDown()
method?
I would prefer to have explicit call to the stop_spinning()
in the tests where it is needed in pair with start_spin(node)
or start_async_spin(node)
with
auto cleanup_process_handle = rcpputils::make_scope_exit([&]() {stop_spinning();});
to have a clean and concise logic in tests.
In some tests we have some checks after explictly calling stop_spinning()
which could rely on the assumption that some handles (node and files) will be free and nothing else will arrive during those extra checks or cleanups.
IMO adding stop_spinning()
to the TearDown()
method is a last resort and a sort of hacky way to handle it and could potentially cause another flakiness in tests in the future.
So the first thing to realize is that because we are putting the spin call in a thread, we have to call stop_spinning() (which calls thread.join()), otherwise we get an exception from the With that in mind, I could have used I have a prototype locally where I did exactly what you said and added in new scopes, with the |
Signed-off-by: Chris Lalancette <[email protected]>
As something of a compromise, 404c78a adds in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clalancette Thanks for adding a safeguard catcher auto cleanup_process_handle = rcpputils::make_scope_exit([&]() {stop_spinning();});
However, my main concern from the previous comment was that the explicit call to the stop_spinning()
was deleted in some tests. Could you please add them back?
I added comments in places where it was deleted.
rosbag2_transport/test/rosbag2_transport/test_record_all_use_sim_time.cpp
Show resolved
Hide resolved
Signed-off-by: Chris Lalancette <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now.
Pulls: #1796 |
@ros-pull-request-builder retest this please |
@clalancette I was able to reproduce freeze locally on Ubuntu by increasing discovery wait interval up to the 500mS
and repeatedly run only those I see in the debugger that freeze occurs in our modified It seems the problem is in the underlying SingleThreadedExecutor spin() and cancel() routines. It looks like cancel doen't really work as expected and we are not exiting from the spin(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clalancette Uff, it was a nasty bug, but I've nailed it down. I've spent a day hunting for it.
I was even start thinking that we have a memory corruption or a race condition in the executor. But not. Everything is more simple.
It is just turns out that we spawn a new spin_thread_
for exec_->spin();
.
However, in some rare cases a context switch happened and we ran throw all test scenario and even call exec_->cancel();
from the stop_spinning()
, but context haven't yet got back to the spin_thread_
and exec_->spin()
.
The root cause of the freeze was that exec_->spin()
was executing after we already called exec_->cancel();
.
I added a comment with proposal to wait untill executor will start spinning before we will get back from the start_async_spin(node)
.
I was able to robustly reproduce this issue and verify fix on my local setup.
rosbag2_transport/test/rosbag2_transport/record_integration_fixture.hpp
Outdated
Show resolved
Hide resolved
…xture.hpp Co-authored-by: Michael Orlov <[email protected]> Signed-off-by: Chris Lalancette <[email protected]>
Signed-off-by: Chris Lalancette <[email protected]>
Pulls: #1796 |
Yep, this seems much happier now. Thanks again @MichaelOrlov , just need approval again then we can merge this in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
https://github.com/Mergifyio backport jazzy |
✅ Backports have been created
|
* Remove wait_until_shutdown. This has almost exactly the same functionality as wait_for_condition, except for two things: 1. It is templated on the Timeout type. 2. It calls rclcpp::shutdown after the loop completes. However, neither of those is necessary; all callers to it use a std::chrono::duration, and all of the test fixtures already call rclcpp::shutdown. Thus, just remove it and make all callers use wait_for_condition instead. Signed-off-by: Chris Lalancette <[email protected]> * Shutdown the async spinner node without rclcpp::shutdown. That is, we really don't actually want to do a full rclcpp shutdown here; we only want to stop spinning. Accomplish that with an executor, and timing out every 100 milliseconds to check if we are done yet. Signed-off-by: Chris Lalancette <[email protected]> * Small fixes to start_async_spin in rosbag2_tests. Make sure it only spins as long as we haven't shutdown, and that it wakes up every so often to check that fact. Signed-off-by: Chris Lalancette <[email protected]> * Wait for topics to be discovered during recorder->record(). The main reason for that is that these tests generally want to test certain expectations around how many messages were received. However, if discovery takes longer than we expect, then it could be the case that we "missed" messages at the beginning because discovery hadn't yet completed. Fix this by just waiting around for the recorder to get all the subscriptions it expects before moving on with the test. Signed-off-by: Chris Lalancette <[email protected]> * Feedback from review. Signed-off-by: Chris Lalancette <[email protected]> * Switch to using MockRecorder. Signed-off-by: Chris Lalancette <[email protected]> * Fixes from review. Signed-off-by: Chris Lalancette <[email protected]> * Feedback from review. Signed-off-by: Chris Lalancette <[email protected]> * Apply suggestions from code review Co-authored-by: Michael Orlov <[email protected]> Signed-off-by: Chris Lalancette <[email protected]> * Switch to using spin, rather than spin_some. That's because there is currently at least one bug associated with spin_some in rclcpp. However, it turns out that we don't even need to use it, as we can just as easily use spin() along with exec.cancel(). Signed-off-by: Chris Lalancette <[email protected]> * Make sure to stop_spinning when we tear down the test. Signed-off-by: Chris Lalancette <[email protected]> * Use scopes to shutdown spinning. Signed-off-by: Chris Lalancette <[email protected]> * Nested contexts just to explicitly cleanup the async spinners. Signed-off-by: Chris Lalancette <[email protected]> * Update rosbag2_transport/test/rosbag2_transport/record_integration_fixture.hpp Co-authored-by: Michael Orlov <[email protected]> Signed-off-by: Chris Lalancette <[email protected]> * Apply the same fix to rosbag2_tests. Signed-off-by: Chris Lalancette <[email protected]> --------- Signed-off-by: Chris Lalancette <[email protected]> Co-authored-by: Michael Orlov <[email protected]> (cherry picked from commit 2d4d02f) # Conflicts: # rosbag2_transport/test/rosbag2_transport/test_record.cpp # rosbag2_transport/test/rosbag2_transport/test_record_all.cpp # rosbag2_transport/test/rosbag2_transport/test_record_all_ignore_leaf_topics.cpp # rosbag2_transport/test/rosbag2_transport/test_record_all_no_discovery.cpp # rosbag2_transport/test/rosbag2_transport/test_record_all_use_sim_time.cpp # rosbag2_transport/test/rosbag2_transport/test_record_regex.cpp
* Remove wait_until_shutdown. This has almost exactly the same functionality as wait_for_condition, except for two things: 1. It is templated on the Timeout type. 2. It calls rclcpp::shutdown after the loop completes. However, neither of those is necessary; all callers to it use a std::chrono::duration, and all of the test fixtures already call rclcpp::shutdown. Thus, just remove it and make all callers use wait_for_condition instead. Signed-off-by: Chris Lalancette <[email protected]> * Shutdown the async spinner node without rclcpp::shutdown. That is, we really don't actually want to do a full rclcpp shutdown here; we only want to stop spinning. Accomplish that with an executor, and timing out every 100 milliseconds to check if we are done yet. Signed-off-by: Chris Lalancette <[email protected]> * Small fixes to start_async_spin in rosbag2_tests. Make sure it only spins as long as we haven't shutdown, and that it wakes up every so often to check that fact. Signed-off-by: Chris Lalancette <[email protected]> * Wait for topics to be discovered during recorder->record(). The main reason for that is that these tests generally want to test certain expectations around how many messages were received. However, if discovery takes longer than we expect, then it could be the case that we "missed" messages at the beginning because discovery hadn't yet completed. Fix this by just waiting around for the recorder to get all the subscriptions it expects before moving on with the test. Signed-off-by: Chris Lalancette <[email protected]> * Feedback from review. Signed-off-by: Chris Lalancette <[email protected]> * Switch to using MockRecorder. Signed-off-by: Chris Lalancette <[email protected]> * Fixes from review. Signed-off-by: Chris Lalancette <[email protected]> * Feedback from review. Signed-off-by: Chris Lalancette <[email protected]> * Apply suggestions from code review Co-authored-by: Michael Orlov <[email protected]> Signed-off-by: Chris Lalancette <[email protected]> * Switch to using spin, rather than spin_some. That's because there is currently at least one bug associated with spin_some in rclcpp. However, it turns out that we don't even need to use it, as we can just as easily use spin() along with exec.cancel(). Signed-off-by: Chris Lalancette <[email protected]> * Make sure to stop_spinning when we tear down the test. Signed-off-by: Chris Lalancette <[email protected]> * Use scopes to shutdown spinning. Signed-off-by: Chris Lalancette <[email protected]> * Nested contexts just to explicitly cleanup the async spinners. Signed-off-by: Chris Lalancette <[email protected]> * Update rosbag2_transport/test/rosbag2_transport/record_integration_fixture.hpp Co-authored-by: Michael Orlov <[email protected]> Signed-off-by: Chris Lalancette <[email protected]> * Apply the same fix to rosbag2_tests. Signed-off-by: Chris Lalancette <[email protected]> --------- Signed-off-by: Chris Lalancette <[email protected]> Co-authored-by: Michael Orlov <[email protected]>
* Remove wait_until_shutdown. This has almost exactly the same functionality as wait_for_condition, except for two things: 1. It is templated on the Timeout type. 2. It calls rclcpp::shutdown after the loop completes. However, neither of those is necessary; all callers to it use a std::chrono::duration, and all of the test fixtures already call rclcpp::shutdown. Thus, just remove it and make all callers use wait_for_condition instead. * Shutdown the async spinner node without rclcpp::shutdown. That is, we really don't actually want to do a full rclcpp shutdown here; we only want to stop spinning. Accomplish that with an executor, and timing out every 100 milliseconds to check if we are done yet. * Small fixes to start_async_spin in rosbag2_tests. Make sure it only spins as long as we haven't shutdown, and that it wakes up every so often to check that fact. * Wait for topics to be discovered during recorder->record(). The main reason for that is that these tests generally want to test certain expectations around how many messages were received. However, if discovery takes longer than we expect, then it could be the case that we "missed" messages at the beginning because discovery hadn't yet completed. Fix this by just waiting around for the recorder to get all the subscriptions it expects before moving on with the test. * Feedback from review. * Switch to using MockRecorder. * Fixes from review. * Feedback from review. * Apply suggestions from code review * Switch to using spin, rather than spin_some. That's because there is currently at least one bug associated with spin_some in rclcpp. However, it turns out that we don't even need to use it, as we can just as easily use spin() along with exec.cancel(). * Make sure to stop_spinning when we tear down the test. * Use scopes to shutdown spinning. * Nested contexts just to explicitly cleanup the async spinners. * Update rosbag2_transport/test/rosbag2_transport/record_integration_fixture.hpp * Apply the same fix to rosbag2_tests. --------- Signed-off-by: Chris Lalancette <[email protected]> Co-authored-by: Chris Lalancette <[email protected]> Co-authored-by: Michael Orlov <[email protected]>
I initially started this series trying to track down a rare failing flakey test with
test_record__rmw_cyclonedds_cpp
. That particular flake seems to be able to happen because sometimes discovery takes longer than we expect, and it is possible that the tests "miss" the first publication. If that's the case, then the rest of the test may fail because it is expecting a certai number of messages. Along the way, we cleanup the tests a bit:wait_until_shutdown
method, which was almost exactly the same aswait_for_condition
.rclcpp::shutdown
.recorder->record()
. This ensures we can't get into the above situation.With this series in place, the particular flake of
test_record
onrmw_cyclonedds_cpp
is fixed (or, at least, I can no longer reproduce it). There are still other flakes that can happen under load, but I think fixes for those will have to come separately.