8000 Defer first full snapshot on etcd startup by shreyas-s-rao · Pull Request #154 · gardener/etcd-backup-restore · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Defer first full snapshot on etcd startup #154

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 8, 2019

Conversation

shreyas-s-rao
Copy link
Collaborator
@shreyas-s-rao shreyas-s-rao commented May 2, 2019

Signed-off-by: Shreyas Rao shreyas.sriganesh.rao@sap.com

What this PR does / why we need it:
This PR defers the first full snapshot on etcd startup, by instead taking a delta snapshot initially and then taking a full snapshot. This helps speeding up setting the readiness probe of etcd to active. If no previous full snapshot exists in the backup or it has been more than 24 hours since the last full snapshot, a full snapshot will be taken on startup, followed by delta snapshots, as was the case earlier.

Which issue(s) this PR fixes:
Fixes #150

Special notes for your reviewer:
Snapshotter Run() has been modified with new delaySeconds parameter to allow setting deferred snapshot timers. Also, for the moment, I have hard-coded 24 hours for checking if previous snapshot is too old, because to make it generic it requires larger effort (to convert any given cron schedule into a uniform time interval). Do let me know if there is an easy/quick way to solve it, or else I can work on it in a separate PR later.
I have tested the PR to the best of my ability. I will also run memory profiling. Please run intensive memory and CPU profiling from your side as well. Thanks.

Release note:

Full snapshot on etcd startup will now be deferred in favour of an initial delta snapshot, followed by a full snapshot and subsequent delta snapshots.

@shreyas-s-rao shreyas-s-rao added priority/critical Needs to be resolved soon, because it impacts users negatively needs/review Needs review component/etcd-backup-restore ETCD Backup & Restore platform/all kind/api-change API change with impact on API users needs/lgtm Needs approval for merging area/backup Backup related area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) area/high-availability High availability related labels May 2, 2019
@shreyas-s-rao shreyas-s-rao added this to the 1904b milestone May 2, 2019
@shreyas-s-rao shreyas-s-rao requested a review from swapnilgm May 2, 2019 16:17
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels May 2, 2019
@shreyas-s-rao shreyas-s-rao requested a review from amshuman-kr May 3, 2019 06:29
@shreyas-s-rao shreyas-s-rao force-pushed the defer-first-full-snapshot branch from 6169ad8 to 0307725 Compare May 3, 2019 17:55
@shreyas-s-rao
Copy link
Collaborator Author
shreyas-s-rao commented May 3, 2019

@swapnilgm I have updated the PR as per the discussion we had. I am still doubtful about the case where it has been more than 24 hours since previous full snapshot, because then the full snapshot will take time to upload and in the meantime, readiness probe will fail as it /healthz endpoint is still not active. Is it feasible to set the readiness probe to active already before taking full snapshot in the case of previous snapshot being older than 24 hours?

Copy link
Collaborator
@amshuman-kr amshuman-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we handle stopCh during a full, delta or first-full/delta snapshot?

ssr.logger.Infof("Applied watch on etcd from revision: %d", ssr.prevSnapshot.LastRevision+1)

for {
wr, ok := <-ssr.watchCh
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do a select with stopCh and/or watchCtx.Done() here?

Do we have the same problem in regular full and delta snapshots?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I've addressed this now.

}
}

return nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We assuming that CollectEventsSincePrevSnapshot executed before ssr.snapshotEventHandler is ever executed in after a restart. Can we document this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@shreyas-s-rao shreyas-s-rao force-pushed the defer-first-full-snapshot branch from 0307725 to b6e40f3 Compare May 7, 2019 11:12
Signed-off-by: Shreyas Rao <shreyas.sriganesh.rao@sap.com>
@shreyas-s-rao shreyas-s-rao force-pushed the defer-first-full-snapshot branch from b6e40f3 to 6f2d514 Compare May 7, 2019 11:33
Copy link
Contributor
@swapnilgm swapnilgm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@swapnilgm swapnilgm added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/lgtm Needs approval for merging needs/review Needs review labels May 8, 2019
@gardener-robot-ci-1 gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels May 8, 2019
@swapnilgm swapnilgm merged commit 9dc6b78 into gardener:master May 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backup Backup related area/high-availability High availability related area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related component/etcd-backup-restore ETCD Backup & Restore kind/api-change API change with impact on API users needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) platform/all priority/critical Needs to be resolved soon, because it impacts users negatively
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid Full Backup on Startup
4 participants
0