-
Notifications
You must be signed in to change notification settings - Fork 75
schedule stress test #287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@jbaldwin I can confirm that this is an issue. In a test with 1,000,000 tasks, I consistently see a small number of them either never getting queued or never resuming past Reproduction total = 999994
thread pool. size = 0, queue_size = 0, num_queued: 999999, num_popped: 999999
container. num cleanup task = 999999, resumed = 999994
m_handles_to_resume: 0, size = 6
total = 999994
thread pool. size = 0, queue_size = 0, num_queued: 999999, num_popped: 999999
container. num cleanup task = 999999, resumed = 999994
What I’ve investigated
Questions / Assistance
Any help or suggestions would be greatly appreciated! Thank you for taking a look. |
Hello, thank you for the report. I have indeed been able to reproduce this with some slightly modified code: TEST_CASE("issue-287", "[io_scheduler]")
{
std::atomic_uint32_t g_count = 0;
auto scheduler = coro::io_scheduler::make_shared(
coro::io_scheduler::options{.pool = coro::thread_pool::options{.thread_count = 1}});
auto task1 = [&]() -> coro::task<void> {
g_count++;
co_return;
};
for (int i = 0; i < 200000; ++i) {
scheduler->schedule(task1());
}
// this is the correct way to wait for all tasks to complete for an io_scheduler
scheduler->shutdown();
std::cerr << g_count.load() << std::endl;
REQUIRE(g_count.load() == 200000);
} With this solution I do occasionally get 199998 or 199999 for g_count, but I haven't seen lower than that yet. I'll need more time to dig into this to figure out where these tasks are getting dropped from. |
I've narrowed it down to The I have some ideas I might try tomorrow around the |
Thanks for taking time to look into this!
My understanding is that with
Thanks a lot! |
I think task_container usage of the vector is probably currently safe. My concern is the executor usage on those tasks, which could be on multiple threads and already executing tasks that are moved out from under them during a resize. The gc internal function I think is just overly complicated, my thoughts are go back to basics and just allocate and delete each time to make sure the behavior is correct. Once that's working then possibly reuse completed ones from the free list, if nothing is free allocate one and drop the grow method entirely. As a side note thinking through this problem there is no proper hook in the coroutine standard to execute code after the coroutine frame is destroyed as far as I know, final_suspend() is before that. So we still need some kind of a gc to scan completed tasks for done() and delete them. In theory shared_ptr with a weak ref would tell you it's completely cleaned up, but I'm not sure that kind of overhead is needed, if prefer to 8000 avoid that if possible. |
I see your point. I am not sure what the behavior is when a task is simultaneously being executed and moved. And yes a list does sound safer in this regard. Thanks for explaining! |
The coro::task_container::gc_internal function was deleting coroutines when marked as .done(), however there is some mechanism that rarely would cause the user_task coroutine to not actually execute. I'm still not sure exactly why this is the case but: 1) Simply disabling gc_internal() would stop the problem 2) Running gc_internal() and moving the coro::task to a 'dead' list still caused the issue. With these in mind I spent time re-reading the specification on the final_suspend and determined that coro::task_container should be a thing atomic counter to track the submitted coroutines and have them self delete. The self deletion is now done via a coro::detail::task_self_destroying coroutine type that takes advantage of the promise's final_suspend() not suspending. The spec states that if this doesn't suspend then the coroutine will call destroy() on itself. Closes #287
The coro::task_container::gc_internal function was deleting coroutines when marked as .done(), however there is some mechanism that rarely would cause the user_task coroutine to not actually execute. I'm still not sure exactly why this is the case but: 1) Simply disabling gc_internal() would stop the problem 2) Running gc_internal() and moving the coro::task to a 'dead' list still caused the issue. With these in mind I spent time re-reading the specification on the final_suspend and determined that coro::task_container should be a thing atomic counter to track the submitted coroutines and have them self delete. The self deletion is now done via a coro::detail::task_self_destroying coroutine type that takes advantage of the promise's final_suspend() not suspending. The spec states that if this doesn't suspend then the coroutine will call destroy() on itself. Closes #287
The coro::task_container::gc_internal function was deleting coroutines when marked as .done(), however there is some mechanism that rarely would cause the user_task coroutine to not actually execute. I'm still not sure exactly why this is the case but: 1) Simply disabling gc_internal() would stop the problem 2) Running gc_internal() and moving the coro::task to a 'dead' list still caused the issue. With these in mind I spent time re-reading the specification on the final_suspend and determined that coro::task_container should be a thing atomic counter to track the submitted coroutines and have them self delete. The self deletion is now done via a coro::detail::task_self_destroying coroutine type that takes advantage of the promise's final_suspend() not suspending. The spec states that if this doesn't suspend then the coroutine will call destroy() on itself. Closes #287
The coro::task_container::gc_internal function was deleting coroutines when marked as .done(), however there is some mechanism that rarely would cause the user_task coroutine to not actually execute. I'm still not sure exactly why this is the case but: 1) Simply disabling gc_internal() would stop the problem 2) Running gc_internal() and moving the coro::task to a 'dead' list still caused the issue. With these in mind I spent time re-reading the specification on the final_suspend and determined that coro::task_container should be a thing atomic counter to track the submitted coroutines and have them self delete. The self deletion is now done via a coro::detail::task_self_destroying coroutine type that takes advantage of the promise's final_suspend() not suspending. The spec states that if this doesn't suspend then the coroutine will call destroy() on itself. Closes #287
Ok I figured out the root issue and have a PR up here. Would you all kindly check it out and run it for yourselves to make sure you agree it is fixed? Code reviews are also welcome. The root issue seems to be that the To fix this I made a special For testing on my end I edited the |
The coro::task_container::gc_internal function was deleting coroutines when marked as .done(), however there is some mechanism that rarely would cause the user_task coroutine to not actually execute. I'm still not sure exactly why this is the case but: 1) Simply disabling gc_internal() would stop the problem 2) Running gc_internal() and moving the coro::task to a 'dead' list still caused the issue. With these in mind I spent time re-reading the specification on the final_suspend and determined that coro::task_container should be a thing atomic counter to track the submitted coroutines and have them self delete. The self deletion is now done via a coro::detail::task_self_destroying coroutine type that takes advantage of the promise's final_suspend() not suspending. The spec states that if this doesn't suspend then the coroutine will call destroy() on itself. Closes #287
The coro::task_container::gc_internal function was deleting coroutines when marked as .done(), however there is some mechanism that rarely would cause the user_task coroutine to not actually execute. I'm still not sure exactly why this is the case but: 1) Simply disabling gc_internal() would stop the problem 2) Running gc_internal() and moving the coro::task to a 'dead' list still caused the issue. With these in mind I spent time re-reading the specification on the final_suspend and determined that coro::task_container should be a thing atomic counter to track the submitted coroutines and have them self delete. The self deletion is now done via a coro::detail::task_self_destroying coroutine type that takes advantage of the promise's final_suspend() not suspending. The spec states that if this doesn't suspend then the coroutine will call destroy() on itself. Closes #287
Thanks for the quick turnaround on this fix—I really appreciate how promptly you addressed it. I’ve tested your branch locally and can confirm it resolves the issue in my reproducer. The new decentralized self‐destruction approach is much simpler to follow and seems more robust. I’m not a coroutine expert, so I apologize in advance if I missed anything in the review, but from my perspective, the changes look great. Thank you again! |
No problem, thanks for following up, and thanks for reviewing the code! I've created a new release v0.13.0 since this was a major flaw in the task_container class. |
Uh oh!
There was an error while loading. Please reload this page.
Hi [Josh Baldwin],
I have a question about the behavior of the StressTest function. I expected g_count to be exactly 200,000 after all tasks are scheduled and executed. However, in practice, the value of g_count is less than 200,000.
Could you clarify why this happens? Is it due to tasks being dropped, preemption, or some other issue in the scheduler or coroutine execution?
Thank you for your insight!
The text was updated successfully, but these errors were encountered: