Intermittent Test Failure in LedgerReplay_test.cpp: Race Condition in Resource Cleanup

@see

Description

Issue Summary

The testPeerSetBehavior method in LedgerReplay_test.cpp intermittently fails with the assertion:

BEAST_EXPECT(net.client.countsAsExpected(0, 0, 0));

This suggests a race condition where resources (tasks, skipLists, or deltas) are not properly cleaned up by the time the assertion is checked, despite the main asynchronous operations appearing to have completed.

Analysis

The test flow is as follows:

Set up a test network with two nodes
Request ledger replay from the client to server
Wait for tasks to complete with waitAndCheckStatus()
Wait for ledgers to be received with waitForLedgers()
Call sweep() to clean up resources
Verify that all resources are cleared with countsAsExpected(0, 0, 0)

The issue occurs because sweep() relies on shared pointer reference counting to determine if objects can be removed:

// In LedgerReplayer::sweep():
auto removeCannotLocked = [](auto& subTasks) {
    for (auto it = subTasks.begin(); it != subTasks.end();) {
        if (auto item = it->second.lock(); !item) {
            it = subTasks.erase(it);
        }
        else
            ++it;
    }
};
removeCannotLocked(skipLists_);
removeCannotLocked(deltas_);

This only removes items if their weak pointers can't be locked, meaning no more strong references exist. Since the code runs in a multi-threaded environment, background operations may still be holding references even after the primary tasks appear to be complete.

Occurrence Pattern

Occurs intermittently in CI but difficult to reproduce locally
Likely relates to timing of thread operations and network I/O
May be more frequent on slower or more heavily loaded systems

Proposed Solutions

Short-term Fix: Add a retry mechanism to the test:

// Add multiple sweep attempts with a delay
for (int attempt = 0; attempt < maxSweepAttempts; ++attempt) {
    std::this_thread::sleep_for(std::chrono::milliseconds(50));
    net.client.replayer.sweep();
    if (net.client.countsAsExpected(0, 0, 0))
        break;
}

Long-term Fix: Improve the sweep() method in LedgerReplayer to:
- Track which skipLists and deltas belong to which tasks
- Aggressively clean up resources belonging to completed tasks, even if references exist
- Add explicit cancelation of any pending operations before cleanup

Diagnostic Enhancement: Add logging to capture reference counts when cleanup fails:

if (!countsCorrect) {
    // Log reference counts for diagnostics
    for (auto const& [hash, weakPtr] : net.client.replayer.skipLists_) {
        if (auto ptr = weakPtr.lock()) {
            JLOG(j_.debug()) << "SkipList " << hash << " has " 
                            << ptr.use_count() << " references";
        }
    }
}

Related Upstream Changes

The upstream rippled repository recently made changes to improve test network isolation with dynamic port allocation, which addresses a similar class of issues but doesn't directly fix this resource cleanup race condition.

@see XRPLF/rippled#5196

Bad Runs

https://github.com/Xahau/xahaud/actions/runs/15034919900/job/42254778083

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Description

Issue Summary

Analysis

Occurrence Pattern

Proposed Solutions

Related Upstream Changes

Bad Runs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Description

Issue Summary

Analysis

Occurrence Pattern

Proposed Solutions

Related Upstream Changes

Bad Runs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions