Description
Problem
1.17.20, 10-12 Feb 2024.
I was curious about a mb validator’s recovery ability. And so I used a spare non-voting mb validator to see if it could recover from an abrupt sudo reboot
I ran default validator settings, so incremental snapshots happen every minute and full-snapshots every 3 hours.
I tried 2 different validator startup scripts:
Script A: included —use-snapshot-archives-at-startup when newest
Script B: it was removed
Test Methodology 1:
sudo reboot
when incremental snapshots are available and less than 1 minute old:
Script A: 3 reboots, 3 successful recoveries each in approx 13 mins
Script B: 3 reboots, 3 successful recoveries each in approx 15 mins
So far so good!
Test Methodology 2:
However, approximately 10 mins before the 3-hour full snapshot is due, the validator stops creating minute-by-minute incrementals and starts only creating the next full snapshot. This means the last incremental gets up to ~15 mins old. [At least, this is my interpretation of what it looks like it's doing!].
sudo reboot
at various times with an old/aging incremental during full snapshot creation (for clarity this is approx a 15 minute window every 3 hours):
Script A:
Incremental 8 mins old: Failed
Incremental 6 mins old: Failed
Incremental 5 mins old: Failed
Script B:
Incremental 7 mins old: Success, took 19 mins
Incremental 9 mins old: Success, took 20 mins
Incremental 13 mins old: Success, took 23 mins
For Test Methodology 2 using Script A, each time it failed for ERROR solana_ledger::bank_forks_utils] Failed to load bank: AccountsFile error: AppendVecError: incorrect layout/length/data in the appendvec at path /mnt/solana-accounts/run/247471897.30603
Proposed Solution
On the advice of Brooks in the discord this issue is opened to address Script A - Test Methodology 2 - failing.