schedule stress test #287

wolfaherd · 2025-01-23T10:37:34Z

Hi [Josh Baldwin],

I have a question about the behavior of the StressTest function. I expected g_count to be exactly 200,000 after all tasks are scheduled and executed. However, in practice, the value of g_count is less than 200,000.

Could you clarify why this happens? Is it due to tasks being dropped, preemption, or some other issue in the scheduler or coroutine execution?

Thank you for your insight!

std::atomic_uint32_t g_count = 0;
static void StressTest() {
  auto scheduler = coro::io_scheduler::make_shared(
      coro::io_scheduler::options{.pool = coro::thread_pool::options{.thread_count = 1}});

  auto task1 = [&]() -> coro::task<void> {
    g_count++;
    co_return;
  };

  for (int i = 0; i < 200000; ++i) {
    scheduler->schedule(task1());
  }

  std::this_thread::sleep_for(std::chrono::seconds(10));
  spdlog::info("g_count = {}", g_count.load());
}

breakds · 2025-01-24T21:14:16Z

@jbaldwin I can confirm that this is an issue. In a test with 1,000,000 tasks, I consistently see a small number of them either never getting queued or never resuming past co_await scheduler->schedule(). Ultimately, the final count of completed tasks falls short by a few.

Reproduction
A minimal demonstration is in my fork/branch. For example, a typical run ends with output like:

total = 999994
thread pool. size = 0, queue_size = 0, num_queued: 999999, num_popped: 999999
container. num cleanup task = 999999, resumed = 999994
m_handles_to_resume: 0, size = 6
total = 999994
thread pool. size = 0, queue_size = 0, num_queued: 999999, num_popped: 999999
container. num cleanup task = 999999, resumed = 999994

Out of 1,000,000 tasks:
- 1 task is never queued (num_queued = 999,999).
- All 999,999 queued tasks are eventually popped (num_popped = 999,999).
- But only 999,994 tasks pass the point after co_await scheduler->schedule(), i.e., only 999,994 increments actually happen.

What I’ve investigated

Scheduling Path
- Each call to scheduler->schedule(...) goes into the I/O scheduler’s task container.
- The container creates a “cleanup task” wrapper and calls resume() on it.
- Inside the cleanup task, we do co_await scheduler->schedule() again, which should enqueue the coroutine handle into the thread_pool.
- A thread pool worker then pops the handle from m_queue and calls handle.resume().
- For most tasks, this works fine.
Findings
- Instrumentation confirms nearly all tasks follow the expected sequence (enqueue → pop → resume).
- However, a small subset of tasks never seem to resume execution after the thread pool calls handle.resume().
- In total, that leaves us short of the final target counter by a handful of tasks.

Questions / Assistance

Have you seen similar issues with high-volume tasks where a few never resume?
Any insight into why handle.resume() might fail to continue coroutine execution, or why a task might never be queued at all?
I’ve spent significant time adding instrumentation and debugging steps, but haven’t pinpointed the root cause.

Any help or suggestions would be greatly appreciated! Thank you for taking a look.

jbaldwin · 2025-01-24T22:19:05Z

Hello, thank you for the report. I have indeed been able to reproduce this with some slightly modified code:

TEST_CASE("issue-287", "[io_scheduler]")
{
    std::atomic_uint32_t g_count = 0;
    auto scheduler = coro::io_scheduler::make_shared(
        coro::io_scheduler::options{.pool = coro::thread_pool::options{.thread_count = 1}});

    auto task1 = [&]() -> coro::task<void> {
        g_count++;
        co_return;
    };

    for (int i = 0; i < 200000; ++i) {
        scheduler->schedule(task1());
    }

    // this is the correct way to wait for all tasks to complete for an io_scheduler
    scheduler->shutdown();

    std::cerr << g_count.load() << std::endl;

    REQUIRE(g_count.load() == 200000);
}

With this solution I do occasionally get 199998 or 199999 for g_count, but I haven't seen lower than that yet. I'll need more time to dig into this to figure out where these tasks are getting dropped from.

jbaldwin · 2025-01-25T04:13:26Z

I've narrowed it down to task_container<thread_pool>::gc_internal. If I disable gc of completed tasks then I get 200000, ran it in a loop for about an hour with no failures. With gc on fails consistently within about 10~ runs, so very quick.

The std::vector<task<void>> in task_container needs to be switched to a std::list<task<void>> to prevent grow() from moving elements around, this is a very obvious bug (oops) but just switching this still hasn't fixed the full issue.

I have some ideas I might try tomorrow around the gc_internal() function but it'll probably add a lot more memory pressure to this class.

breakds · 2025-01-25T07:39:47Z

Thanks for taking time to look into this!

The std::vector<task> in task_container needs to be switched to a std::list<task> to prevent grow() from moving elements around

My understanding is that with m_tasks.resize in grow(), the capacity of the vector will be doubled and the original elements will be "moved" (but their indices in m_tasks should remain the same), which seems to be harmless.

I have some ideas I might try tomorrow around the gc_internal() function but it'll probably add a lot more memory pressure to this class.

Thanks a lot!

jbaldwin · 2025-01-25T16:09:04Z

I think task_container usage of the vector is probably currently safe. My concern is the executor usage on those tasks, which could be on multiple threads and already executing tasks that are moved out from under them during a resize.

The gc internal function I think is just overly complicated, my thoughts are go back to basics and just allocate and delete each time to make sure the behavior is correct. Once that's working then possibly reuse completed ones from the free list, if nothing is free allocate one and drop the grow method entirely. As a side note thinking through this problem there is no proper hook in the coroutine standard to execute code after the coroutine frame is destroyed as far as I know, final_suspend() is before that. So we still need some kind of a gc to scan completed tasks for done() and delete them. In theory shared_ptr with a weak ref would tell you it's completely cleaned up, but I'm not sure that kind of overhead is needed, if prefer to 8000 avoid that if possible.

breakds · 2025-01-25T18:38:12Z

multiple threads and already executing tasks that are moved out from under them during a resize.

I see your point. I am not sure what the behavior is when a task is simultaneously being executed and moved. And yes a list does sound safer in this regard. Thanks for explaining!

The coro::task_container::gc_internal function was deleting coroutines when marked as .done(), however there is some mechanism that rarely would cause the user_task coroutine to not actually execute. I'm still not sure exactly why this is the case but: 1) Simply disabling gc_internal() would stop the problem 2) Running gc_internal() and moving the coro::task to a 'dead' list still caused the issue. With these in mind I spent time re-reading the specification on the final_suspend and determined that coro::task_container should be a thing atomic counter to track the submitted coroutines and have them self delete. The self deletion is now done via a coro::detail::task_self_destroying coroutine type that takes advantage of the promise's final_suspend() not suspending. The spec states that if this doesn't suspend then the coroutine will call destroy() on itself. Closes #287

jbaldwin · 2025-01-29T19:42:29Z

Ok I figured out the root issue and have a PR up here. Would you all kindly check it out and run it for yourselves to make sure you agree it is fixed? Code reviews are also welcome.

The root issue seems to be that the .done() check in the gc_internal() is somehow race conditiony with the user_task executing, and very rarely as we saw it would cause the body of the user_task to not actually execute. Nothing else in task_container was actually a problem, including the vector usage.. but with this fix none of that exists anymore (yay).

To fix this I made a special coro::detail::task_self_destroying class which uses a part of the spec to self delete upon completion. This means that the coro::task_container no longer tracks the coro::task<void> cleanup tasks.. it just fires them into the void and waits for a decrement on an atomic to verify the number of outstanding tasks. The task_self_destroying does this via the promise_type::final_suspend() -> std::suspend_never, in the spec if final_suspend doesn't suspend then the coroutine is defined as calling destroy() on itself.

For testing on my end I edited the test/main.cpp to not generate an ssl cert (takes too long when running one test) and then ran while true; do ./test/libcoro_test "issue-287"; done for an hour. Before the fix I usually got a failure within 10 tries.

The coro::task_container::gc_internal function was deleting coroutines when marked as .done(), however there is some mechanism that rarely would cause the user_task coroutine to not actually execute. I'm still not sure exactly why this is the case but: 1) Simply disabling gc_internal() would stop the problem 2) Running gc_internal() and moving the coro::task to a 'dead' list still caused the issue. With these in mind I spent time re-reading the specification on the final_suspend and determined that coro::task_container should be a thing atomic counter to track the submitted coroutines and have them self delete. The self deletion is now done via a coro::detail::task_self_destroying coroutine type that takes advantage of the promise's final_suspend() not suspending. The spec states that if this doesn't suspend then the coroutine will call destroy() on itself. Closes #287

breakds · 2025-01-29T21:55:18Z

Thanks for the quick turnaround on this fix—I really appreciate how promptly you addressed it. I’ve tested your branch locally and can confirm it resolves the issue in my reproducer. The new decentralized self‐destruction approach is much simpler to follow and seems more robust.

I’m not a coroutine expert, so I apologize in advance if I missed anything in the review, but from my perspective, the changes look great. Thank you again!

jbaldwin · 2025-01-30T15:07:08Z

No problem, thanks for following up, and thanks for reviewing the code!

I've created a new release v0.13.0 since this was a major flaw in the task_container class.

jbaldwin mentioned this issue Jan 29, 2025

coro::task_container gc fix not completing coroutines #288

Merged

jbaldwin self-assigned this Jan 29, 2025

jbaldwin closed this as completed in #288 Jan 30, 2025

jbaldwin closed this as completed in 49333e7 Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

schedule stress test #287

schedule stress test #287

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

schedule stress test #287

schedule stress test #287

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!