8000 Intermittent Test Failure in LedgerReplay_test.cpp: Race Condition in Resource Cleanup · Issue #512 · Xahau/xahaud · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Intermittent Test Failure in LedgerReplay_test.cpp: Race Condition in Resource Cleanup #512
Open
@sublimator

Description

@sublimator

Description

Issue Summary

The testPeerSetBehavior method in LedgerReplay_test.cpp intermittently fails with the assertion:

BEAST_EXPECT(net.client.countsAsExpected(0, 0, 0));

This suggests a race condition where resources (tasks, skipLists, or deltas) are not properly cleaned up by the time the assertion is checked, despite the main asynchronous operations appearing to have completed.

Analysis

The test flow is as follows:

  1. Set up a test network with two nodes
  2. Request ledger replay from the client to server
  3. Wait for tasks to complete with waitAndCheckStatus()
  4. Wait for ledgers to be received with waitForLedgers()
  5. Call sweep() to clean up resources
  6. Verify that all resources are cleared with countsAsExpected(0, 0, 0)

The issue occurs because sweep() relies on shared pointer reference counting to determine if objects can be removed:

// In LedgerReplayer::sweep():
auto removeCannotLocked = [](auto& subTasks) {
    for (auto it = subTasks.begin(); it != subTasks.end();) {
        if (auto item = it->second.lock(); !item) {
            it = subTasks.erase(it);
        }
        else
            ++it;
    }
};
removeCannotLocked(skipLists_);
removeCannotLocked(deltas_);

This only removes items if their weak pointers can't be locked, meaning no more strong references exist. Since the code runs in a multi-threaded environment, background operations may still be holding references even after the primary tasks appear to be complete.

Occurrence Pattern

  • Occurs intermittently in CI but difficult to reproduce locally
  • Likely relates to timing of thread operations and network I/O
  • May be more frequent on slower or more heavily loaded systems

Proposed Solutions

  1. Short-term Fix: Add a retry mechanism to the test:

    // Add multiple sweep attempts with a delay
    for (int attempt = 0; attempt < maxSweepAttempts; ++attempt) {
        std::this_thread::sleep_for(std::chrono::milliseconds(50));
        net.client.replayer.sweep();
        if (net.client.countsAsExpected(0, 0, 0))
            break;
    }
  2. Long-term Fix: Improve the sweep() method in LedgerReplayer to:

    • Track which skipLists and deltas belong to which tasks
    • Aggressively clean up resources belonging to completed tasks, even if references exist
    • Add explicit cancelation of any pending operations before cleanup
  3. Diagnostic Enhancement: Add logging to capture reference counts when cleanup fails:

    if (!countsCorrect) {
        // Log reference counts for diagnostics
        for (auto const& [hash, weakPtr] : net.client.replayer.skipLists_) {
            if (auto ptr = weakPtr.lock()) {
                JLOG(j_.debug()) << "SkipList " << hash << " has " 
                                << ptr.use_count() << " references";
            }
        }
    }

Related Upstream Changes

The upstream rippled repository recently made changes to improve test network isolation with dynamic port allocation, which addresses a similar class of issues but doesn't directly fix this resource cleanup race condition.

@see XRPLF/rippled#5196

Bad Runs

https://github.com/Xahau/xahaud/actions/runs/15034919900/job/42254778083

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0