8000 schedule stress test · Issue #287 · jbaldwin/libcoro · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

schedule stress test #287

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wolfaherd opened this issue Jan 23, 2025 · 9 comments · Fixed by #288
Closed

schedule stress test #287

wolfaherd opened this issue Jan 23, 2025 · 9 comments · Fixed by #288
Assignees

Comments

@wolfaherd
Copy link
wolfaherd commented Jan 23, 2025

Hi [Josh Baldwin],

I have a question about the behavior of the StressTest function. I expected g_count to be exactly 200,000 after all tasks are scheduled and executed. However, in practice, the value of g_count is less than 200,000.

Could you clarify why this happens? Is it due to tasks being dropped, preemption, or some other issue in the scheduler or coroutine execution?

Thank you for your insight!

std::atomic_uint32_t g_count = 0;
static void StressTest() {
  auto scheduler = coro::io_scheduler::make_shared(
      coro::io_scheduler::options{.pool = coro::thread_pool::options{.thread_count = 1}});

  auto task1 = [&]() -> coro::task<void> {
    g_count++;
    co_return;
  };

  for (int i = 0; i < 200000; ++i) {
    scheduler->schedule(task1());
  }

  std::this_thread::sleep_for(std::chrono::seconds(10));
  spdlog::info("g_count = {}", g_count.load());
}
@breakds
Copy link
breakds commented Jan 24, 2025

@jbaldwin I can confirm that this is an issue. In a test with 1,000,000 tasks, I consistently see a small number of them either never getting queued or never resuming past co_await scheduler->schedule(). Ultimately, the final count of completed tasks falls short by a few.

Reproduction
A minimal demonstration is in my fork/branch. For example, a typical run ends with output like:

total = 999994
thread pool. size = 0, queue_size = 0, num_queued: 999999, num_popped: 999999
container. num cleanup task = 999999, resumed = 999994
m_handles_to_resume: 0, size = 6
total = 999994
thread pool. size = 0, queue_size = 0, num_queued: 999999, num_popped: 999999
container. num cleanup task = 999999, resumed = 999994
  • Out of 1,000,000 tasks:
    • 1 task is never queued (num_queued = 999,999).
    • All 999,999 queued tasks are eventually popped (num_popped = 999,999).
    • But only 999,994 tasks pass the point after co_await scheduler->schedule(), i.e., only 999,994 increments actually happen.

What I’ve investigated

  1. Scheduling Path
    • Each call to scheduler->schedule(...) goes into the I/O scheduler’s task container.
    • The container creates a “cleanup task” wrapper and calls resume() on it.
    • Inside the cleanup task, we do co_await scheduler->schedule() again, which should enqueue the coroutine handle into the thread_pool.
    • A thread pool worker then pops the handle from m_queue and calls handle.resume().
    • For most tasks, this works fine.
  2. Findings
    • Instrumentation confirms nearly all tasks follow the expected sequence (enqueue → pop → resume).
    • However, a small subset of tasks never seem to resume execution after the thread pool calls handle.resume().
    • In total, that leaves us short of the final target counter by a handful of tasks.

Questions / Assistance

  • Have you seen similar issues with high-volume tasks where a few never resume?
  • Any insight into why handle.resume() might fail to continue coroutine execution, or why a task might never be queued at all?
  • I’ve spent significant time adding instrumentation and debugging steps, but haven’t pinpointed the root cause.

Any help or suggestions would be greatly appreciated! Thank you for taking a look.

@jbaldwin
Copy link
Owner

Hello, thank you for the report. I have indeed been able to reproduce this with some slightly modified code:

TEST_CASE("issue-287", "[io_scheduler]")
{
    std::atomic_uint32_t g_count = 0;
    auto scheduler = coro::io_scheduler::make_shared(
        coro::io_scheduler::options{.pool = coro::thread_pool::options{.thread_count = 1}});

    auto task1 = [&]() -> coro::task<void> {
        g_count++;
        co_return;
    };

    for (int i = 0; i < 200000; ++i) {
        scheduler->schedule(task1());
    }

    // this is the correct way to wait for all tasks to complete for an io_scheduler
    scheduler->shutdown();

    std::cerr << g_count.load() << std::endl;

    REQUIRE(g_count.load() == 200000);
}

With this solution I do occasionally get 199998 or 199999 for g_count, but I haven't seen lower than that yet. I'll need more time to dig into this to figure out where these tasks are getting dropped from.

@jbaldwin
Copy link
Owner

I've narrowed it down to task_container<thread_pool>::gc_internal. If I disable gc of completed tasks then I get 200000, ran it in a loop for about an hour with no failures. With gc on fails consistently within about 10~ runs, so very quick.

The std::vector<task<void>> in task_container needs to be switched to a std::list<task<void>> to prevent grow() from moving elements around, this is a very obvious bug (oops) but just switching this still hasn't fixed the full issue.

I have some ideas I might try tomorrow around the gc_internal() function but it'll probably add a lot more memory pressure to this class.

@breakds
Copy link
breakds commented Jan 25, 2025

Thanks for taking time to look into this!

The std::vector<task> in task_container needs to be switched to a std::list<task> to prevent grow() from moving elements around

My understanding is that with m_tasks.resize in grow(), the capacity of the vector will be doubled and the original elements will be "moved" (but their indices in m_tasks should remain the same), which seems to be harmless.

I have some ideas I might try tomorrow around the gc_internal() function but it'll probably add a lot more memory pressure to this class.

Thanks a lot!

@jbaldwin
Copy link
Owner

I think task_container usage of the vector is probably currently safe. My concern is the executor usage on those tasks, which could be on multiple threads and already executing tasks that are moved out from under them during a resize.

The gc internal function I think is just overly complicated, my thoughts are go back to basics and just allocate and delete each time to make sure the behavior is correct. Once that's working then possibly reuse completed ones from the free list, if nothing is free allocate one and drop the grow method entirely. As a side note thinking through this problem there is no proper hook in the coroutine standard to execute code after the coroutine frame is destroyed as far as I know, final_suspend() is before that. So we still need some kind of a gc to scan completed tasks for done() and delete them. In theory shared_ptr with a weak ref would tell you it's completely cleaned up, but I'm not sure that kind of overhead is needed, if prefer to 8000 avoid that if possible.

@breakds
Copy link
breakds commented Jan 25, 2025

multiple threads and already executing tasks that are moved out from under them during a resize.

I see your point. I am not sure what the behavior is when a task is simultaneously being executed and moved. And yes a list does sound safer in this regard. Thanks for explaining!

jbaldwin added a commit that referenced this issue Jan 29, 2025
The coro::task_container::gc_internal function was deleting coroutines
when marked as .done(), however there is some mechanism that rarely
would cause the user_task coroutine to not actually execute. I'm still
not sure exactly why this is the case but:

1) Simply disabling gc_internal() would stop the problem
2) Running gc_internal() and moving the coro::task to a 'dead' list
   still caused the issue.

With these in mind I spent time re-reading the specification on the
final_suspend and determined that coro::task_container should be a thing
atomic counter to track the submitted coroutines and have them self
delete. The self deletion is now done via a
coro::detail::task_self_destroying coroutine type that takes advantage
of the promise's final_suspend() not suspending. The spec states that if
this doesn't suspend then the coroutine will call destroy() on itself.

Closes #287
jbaldwin added a commit that referenced this issue Jan 29, 2025
The coro::task_container::gc_internal function was deleting coroutines
when marked as .done(), however there is some mechanism that rarely
would cause the user_task coroutine to not actually execute. I'm still
not sure exactly why this is the case but:

1) Simply disabling gc_internal() would stop the problem
2) Running gc_internal() and moving the coro::task to a 'dead' list
   still caused the issue.

With these in mind I spent time re-reading the specification on the
final_suspend and determined that coro::task_container should be a thing
atomic counter to track the submitted coroutines and have them self
delete. The self deletion is now done via a
coro::detail::task_self_destroying coroutine type that takes advantage
of the promise's final_suspend() not suspending. The spec states that if
this doesn't suspend then the coroutine will call destroy() on itself.

Closes #287
jbaldwin added a commit that referenced this issue Jan 29, 2025
The coro::task_container::gc_internal function was deleting coroutines
when marked as .done(), however there is some mechanism that rarely
would cause the user_task coroutine to not actually execute. I'm still
not sure exactly why this is the case but:

1) Simply disabling gc_internal() would stop the problem
2) Running gc_internal() and moving the coro::task to a 'dead' list
   still caused the issue.

With these in mind I spent time re-reading the specification on the
final_suspend and determined that coro::task_container should be a thing
atomic counter to track the submitted coroutines and have them self
delete. The self deletion is now done via a
coro::detail::task_self_destroying coroutine type that takes advantage
of the promise's final_suspend() not suspending. The spec states that if
this doesn't suspend then the coroutine will call destroy() on itself.

Closes #287
jbaldwin added a commit that referenced this issue Jan 29, 2025
The coro::task_container::gc_internal function was deleting coroutines
when marked as .done(), however there is some mechanism that rarely
would cause the user_task coroutine to not actually execute. I'm still
not sure exactly why this is the case but:

1) Simply disabling gc_internal() would stop the problem
2) Running gc_internal() and moving the coro::task to a 'dead' list
   still caused the issue.

With these in mind I spent time re-reading the specification on the
final_suspend and determined that coro::task_container should be a thing
atomic counter to track the submitted coroutines and have them self
delete. The self deletion is now done via a
coro::detail::task_self_destroying coroutine type that takes advantage
of the promise's final_suspend() not suspending. The spec states that if
this doesn't suspend then the coroutine will call destroy() on itself.

Closes #287
@jbaldwin
Copy link
Owner

Ok I figured out the root issue and have a PR up here. Would you all kindly check it out and run it for yourselves to make sure you agree it is fixed? Code reviews are also welcome.

The root issue seems to be that the .done() check in the gc_internal() is somehow race conditiony with the user_task executing, and very rarely as we saw it would cause the body of the user_task to not actually execute. Nothing else in task_container was actually a problem, including the vector usage.. but with this fix none of that exists anymore (yay).

To fix this I made a special coro::detail::task_self_destroying class which uses a part of the spec to self delete upon completion. This means that the coro::task_container no longer tracks the coro::task<void> cleanup tasks.. it just fires them into the void and waits for a decrement on an atomic to verify the number of outstanding tasks. The task_self_destroying does this via the promise_type::final_suspend() -> std::suspend_never, in the spec if final_suspend doesn't suspend then the coroutine is defined as calling destroy() on itself.

For testing on my end I edited the test/main.cpp to not generate an ssl cert (takes too long when running one test) and then ran while true; do ./test/libcoro_test "issue-287"; done for an hour. Before the fix I usually got a failure within 10 tries.

jbaldwin added a commit that referenced this issue Jan 29, 2025
The coro::task_container::gc_internal function was deleting coroutines
when marked as .done(), however there is some mechanism that rarely
would cause the user_task coroutine to not actually execute. I'm still
not sure exactly why this is the case but:

1) Simply disabling gc_internal() would stop the problem
2) Running gc_internal() and moving the coro::task to a 'dead' list
   still caused the issue.

With these in mind I spent time re-reading the specification on the
final_suspend and determined that coro::task_container should be a thing
atomic counter to track the submitted coroutines and have them self
delete. The self deletion is now done via a
coro::detail::task_self_destroying coroutine type that takes advantage
of the promise's final_suspend() not suspending. The spec states that if
this doesn't suspend then the coroutine will call destroy() on itself.

Closes #287
@jbaldwin jbaldwin self-assigned this Jan 29, 2025
jbaldwin added a commit that referenced this issue Jan 29, 2025
The coro::task_container::gc_internal function was deleting coroutines
when marked as .done(), however there is some mechanism that rarely
would cause the user_task coroutine to not actually execute. I'm still
not sure exactly why this is the case but:

1) Simply disabling gc_internal() would stop the problem
2) Running gc_internal() and moving the coro::task to a 'dead' list
   still caused the issue.

With these in mind I spent time re-reading the specification on the
final_suspend and determined that coro::task_container should be a thing
atomic counter to track the submitted coroutines and have them self
delete. The self deletion is now done via a
coro::detail::task_self_destroying coroutine type that takes advantage
of the promise's final_suspend() not suspending. The spec states that if
this doesn't suspend then the coroutine will call destroy() on itself.

Closes #287
@breakds
Copy link
breakds commented Jan 29, 2025

Thanks for the quick turnaround on this fix—I really appreciate how promptly you addressed it. I’ve tested your branch locally and can confirm it resolves the issue in my reproducer. The new decentralized self‐destruction approach is much simpler to follow and seems more robust.

I’m not a coroutine expert, so I apologize in advance if I missed anything in the review, but from my perspective, the changes look great. Thank you again!

@jbaldwin
Copy link
Owner
jbaldwin commented Jan 30, 2025

No problem, thanks for following up, and thanks for reviewing the code!

I've created a new release v0.13.0 since this was a major flaw in the task_container class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants
0