Description
When an SST is deleted from a bucket before an OR download job is able to download the SST, we expect the download job to fail.
However, there exists a certain set of circumstances that will allow the download job to complete successfully despite the missing SST, namely when an SST that only covers a system table is deleted. In TestFullClusterOnlineRestoreRecovery
, we are seeing such failures.
The issue seems to occur only during stress race runs of the CI and only periodically. However, reproducing the issue by adding a sleep between the link phase and download phase seems to be reliable.
Given the method of reproduction, it seems as though GC is having an impact on the result of the DownloadSpanRequest
in sendDownloadSpan
. Since system tables are always dropped after the link phase, it is possible that when a table is GC'd, attempting to fulfill a DownloadSpanRequest
on its span will not return an error and no-op. That being said, these DownloadSpanRequest
s are sent directly to the storage layer, so the behavior here may require more investigation.
Jira issue: CRDB-51604