Multi-threaded Executor starvation fix #2702

HarunTeper · 2024-12-09T13:36:15Z

Pull request addressing the issues in #2360 #2645.

So far, I have added a test that detects starvation in the multi-threaded executor.
This test includes a mutually-exclusive callback group with two timers.
The executor should be alternating between these two tasks, never executing one task twice before the other.

To fix starvation, I have identified the following steps (which I was not able to completely implement yet):

Introduce a new mutex (I refer to this as 'notify_mutex_') to the executor.hpp file that is used to guard callback group flags.
Introduce a function that locks the 'notify_mutex_', triggers the 'interrupt_guard_condition_' and unlocks the callback group flag (and the 'notify_mutex_' afterwards). Currently, the 'MultiThreadedExecutor::run' function includes the guard condition trigger method, while the 'Executor::execute_any_executable' includes the call that unlocks the callback group. These need to be combined and guarded by the 'notify_mutex_'.
The 'Executor::wait_for_work' function needs to be split into two funtions. One may be called 'Executor::prepare_wait_set', which does everything up to (but excluding) the 'wait' function which is currently in the 'wait_for_work' function. The rest of the wait for work function can be kept in the current function. This change is necessary to lock the 'notify_mutex' while a callback is being extracted from the wait set by 'get_next_executable', ensuring that no other thread can change a callback group flag at the same time.
The most complex change: The function that collects and adds the entities to the 'wait_set_' needs to be updated. First, the 'wait_result_' should not be reset. Then a function needs to be added, which adds all callback instances from the previous 'wait_result_' to the current 'wait_result_', if they are blocked and not invalid. This new function should be executed after the 'wait_set_.wait' function. Otherwise, the previously blocked jobs would immediately unblock the wait function, as they are already ready when it is called, which would then lead to busy waiting. Furthermore, for this change, the position at which the previous job instances are added also needs to be decided, which would then decide the priority of the jobs after inserting. For this change, it may also be necessary to introduce a variable called 'previous_wait_result_'.

These steps are based on the work I published here:

https://ieeexplore.ieee.org/document/9622336

https://daes.cs.tu-dortmund.de/storages/daes-cs/r/publications/teper2024emsoft_preprint.pdf

I have already tried to implement some of these steps, and I will also commit some of the changes to this fork this week. However, for step 4, I may require some help. I also noticed that my changes break some of the tests are currently part of rclcpp, as I move the functions that set callback group flags and trigger guard conditions.

Signed-off-by: HarunTeper <[email protected]>

fujitatomoya · 2024-12-12T05:03:52Z

rclcpp/test/rclcpp/executors/test_multi_threaded_executor.cpp

+/*
+   Test that no tasks are starved
+ */
+TEST_F(TestMultiThreadedExecutor, starvation) {


why this should be assertion for the test? if we create the custom callback group with rclcpp::CallbackGroupType::Reentrant and add it to create_wall_timer, this test should pass because there are 2 threads that executor can assign the executables concurrently.
in other word, this is expected behavior that user specifies with MutuallyExclusive (default) as we discussed on #2645?

SingleThreadedExecutor has the exact same situation like this. if you are trying to fix this timer starvation in the Executor, i do not think that is the problem only for MultiThreadedExecutor, if the timer callback overruns.

The executor should be alternating between these two tasks, never executing one task twice before the other.

IMO, i am not sure if this assumption is correct by system. sounds like user requirement, user would want to have a high priority timer which they want it to be executed as fast as possible. in that case, this fix becomes the problem for that requirement. (in that case, what's missing here is priority order user interface?)

CC: @jmachowinski @mjcarroll @alsora

In the unit-test you are creating two timers with a period of 0ms.
In this scenario there's no guarantee that both timers will be called, because they are both always ready all the time. Note that the problem is not that they have the same period, but rather that the period is 0ms.
Since they are both always ready, only the "first one" will be invoked.

Can you please check what happens if the period is set to 10ms for both?

jmachowinski · 2024-12-12T11:54:32Z

rclcpp/test/rclcpp/executors/test_multi_threaded_executor.cpp

+    auto start_time = std::chrono::steady_clock::now();
+    while (std::chrono::steady_clock::now() - start_time < 100ms) {
+    }


Suggested change

auto start_time = std::chrono::steady_clock::now();

while (std::chrono::steady_clock::now() - start_time < 100ms) {

}

std::this_thread::wait_for(100ms);

jmachowinski · 2024-12-12T11:56:18Z

rclcpp/test/rclcpp/executors/test_multi_threaded_executor.cpp

+    auto start_time = std::chrono::steady_clock::now();
+    while (std::chrono::steady_clock::now() - start_time < 100ms) {
+    }


Suggested change

auto start_time = std::chrono::steady_clock::now();

while (std::chrono::steady_clock::now() - start_time < 100ms) {

}

std::this_thread::wait_for(100ms);

jmachowinski · 2024-12-12T13:22:59Z

The proposed changes, as well as the paper, don't make sense to me. But this might be a wording vs implementation issue. Lets wait for the actual implementation...
@HarunTeper you might want to attend to the next client workgroup meeting, to discuss this in person.

alsora · 2024-12-12T20:21:25Z

Hi @HarunTeper,
I agree that this topic seems complex enough that it would be better to have an in-person conversation.

The next Client Library WG meeting will happen Friday 12/20/2024 at 8AM PST.
See here some details, and I will post a reminder next week https://discourse.ros.org/t/next-client-library-wg-meeting-friday-6th-december-2024-8am-pt/40954

alsora · 2024-12-12T20:24:19Z

rclcpp/test/rclcpp/executors/test_multi_threaded_executor.cpp

+  timer_two = node->create_wall_timer(0ms, timer_two_callback);
+
+  executor.add_node(node);
+  executor.spin();


@HarunTeper it would be nice to convert this into a "real" unit-test.

Right now this code will run forever until an assertion fails, so it's not acceptable.
Moreover, we should remove the debug prints.
You should use GTEST assertions and expectations to provide the necessary information (note that you can add log to those; the logs will be printed only when the test fails).

Added test for starvation in the multi-threaded executor

f6f72c8

Signed-off-by: HarunTeper <[email protected]>

HarunTeper changed the title ~~Added test for starvation in the multi-threaded executor~~ Multi-threaded Executor starvation fix Dec 9, 2024

fujitatomoya reviewed Dec 12, 2024

View reviewed changes

jmachowinski suggested changes Dec 12, 2024

View reviewed changes

alsora reviewed Dec 12, 2024

View reviewed changes

sloretz assigned alsora Dec 20, 2024

sloretz added the more-information-needed Further information is required label Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-threaded Executor starvation fix #2702

Multi-threaded Executor starvation fix #2702

HarunTeper commented Dec 9, 2024 •

edited

Loading

fujitatomoya Dec 12, 2024

alsora Dec 12, 2024

jmachowinski Dec 12, 2024

jmachowinski Dec 12, 2024

jmachowinski commented Dec 12, 2024

alsora commented Dec 12, 2024 •

edited

Loading

alsora Dec 12, 2024

Multi-threaded Executor starvation fix #2702

Are you sure you want to change the base?

Multi-threaded Executor starvation fix #2702

Conversation

HarunTeper commented Dec 9, 2024 • edited Loading

fujitatomoya Dec 12, 2024

Choose a reason for hiding this comment

alsora Dec 12, 2024

Choose a reason for hiding this comment

jmachowinski Dec 12, 2024

Choose a reason for hiding this comment

jmachowinski Dec 12, 2024

Choose a reason for hiding this comment

jmachowinski commented Dec 12, 2024

alsora commented Dec 12, 2024 • edited Loading

alsora Dec 12, 2024

Choose a reason for hiding this comment

HarunTeper commented Dec 9, 2024 •

edited

Loading

alsora commented Dec 12, 2024 •

edited

Loading