8000 restore: investigate succeeding OR download jobs after corruption · Issue #148408 · cockroachdb/cockroach · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
restore: investigate succeeding OR download jobs after corruption #148408
Closed
@kev-cao

Description

@kev-cao

When an SST is deleted from a bucket before an OR download job is able to download the SST, we expect the download job to fail.

However, there exists a certain set of circumstances that will allow the download job to complete successfully despite the missing SST, namely when an SST that only covers a system table is deleted. In TestFullClusterOnlineRestoreRecovery, we are seeing such failures.

The issue seems to occur only during stress race runs of the CI and only periodically. However, reproducing the issue by adding a sleep between the link phase and download phase seems to be reliable.


Given the method of reproduction, it seems as though GC is having an impact on the result of the DownloadSpanRequest in sendDownloadSpan. Since system tables are always dropped after the link phase, it is possible that when a table is GC'd, attempting to fulfill a DownloadSpanRequest on its span will not return an error and no-op. That being said, these DownloadSpanRequests are sent directly to the storage layer, so the behavior here may require more investigation.

Jira issue: CRDB-51604

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0