Description
Description
Issue Summary
The testPeerSetBehavior
method in LedgerReplay_test.cpp
intermittently fails with the assertion:
BEAST_EXPECT(net.client.countsAsExpected(0, 0, 0));
This suggests a race condition where resources (tasks, skipLists, or deltas) are not properly cleaned up by the time the assertion is checked, despite the main asynchronous operations appearing to have completed.
Analysis
The test flow is as follows:
- Set up a test network with two nodes
- Request ledger replay from the client to server
- Wait for tasks to complete with
waitAndCheckStatus()
- Wait for ledgers to be received with
waitForLedgers()
- Call
sweep()
to clean up resources - Verify that all resources are cleared with
countsAsExpected(0, 0, 0)
The issue occurs because sweep()
relies on shared pointer reference counting to determine if objects can be removed:
// In LedgerReplayer::sweep():
auto removeCannotLocked = [](auto& subTasks) {
for (auto it = subTasks.begin(); it != subTasks.end();) {
if (auto item = it->second.lock(); !item) {
it = subTasks.erase(it);
}
else
++it;
}
};
removeCannotLocked(skipLists_);
removeCannotLocked(deltas_);
This only removes items if their weak pointers can't be locked, meaning no more strong references exist. Since the code runs in a multi-threaded environment, background operations may still be holding references even after the primary tasks appear to be complete.
Occurrence Pattern
- Occurs intermittently in CI but difficult to reproduce locally
- Likely relates to timing of thread operations and network I/O
- May be more frequent on slower or more heavily loaded systems
Proposed Solutions
-
Short-term Fix: Add a retry mechanism to the test:
// Add multiple sweep attempts with a delay for (int attempt = 0; attempt < maxSweepAttempts; ++attempt) { std::this_thread::sleep_for(std::chrono::milliseconds(50)); net.client.replayer.sweep(); if (net.client.countsAsExpected(0, 0, 0)) break; }
-
Long-term Fix: Improve the
sweep()
method inLedgerReplayer
to:- Track which skipLists and deltas belong to which tasks
- Aggressively clean up resources belonging to completed tasks, even if references exist
- Add explicit cancelation of any pending operations before cleanup
-
Diagnostic Enhancement: Add logging to capture reference counts when cleanup fails:
if (!countsCorrect) { // Log reference counts for diagnostics for (auto const& [hash, weakPtr] : net.client.replayer.skipLists_) { if (auto ptr = weakPtr.lock()) { JLOG(j_.debug()) << "SkipList " << hash << " has " << ptr.use_count() << " references"; } } }
Related Upstream Changes
The upstream rippled repository recently made changes to improve test network isolation with dynamic port allocation, which addresses a similar class of issues but doesn't directly fix this resource cleanup race condition.
Bad Runs
https://github.com/Xahau/xahaud/actions/runs/15034919900/job/42254778083