-
Notifications
You must be signed in to change notification settings - Fork 8k
fix: change HPA minReplicas to 2 to resolve PDB issues #25873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
Hi @agilgur5. Thanks for your PR. I'm waiting for a istio member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@googlebot I signed it |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add validation logic to resolve issue, not only change the default value for PDB and HPA.
End users may configure incompliant minReplica value for PDB with HPA which prevents pods from restarting, blocking rescheduling, upgrade...
Please read through #12602 - there is a pretty good analysis of the problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an initial PR, I think it is fine not to validate the defaults. More important is to fix this problem and backport this for 1.7.
I don't understand why the hpa spec replicas are increased to 2. Seems like it would be sufficient to keep hpa spec at 1/5, and then increase the replicas to 2.
@agilgur5 nice first PR! Thank you for the contribution. We have struggled and struggled a project with what to do about HPAs. The default for Istio is to install HPAs, therefore, the defaults should account for HPAs being installed. After reading the latest K8s docs on ratios used to determine an HPA, I am not clear this PR is correct, but you are on the right track. Cheers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel comfortable making the min replicas==1 for the default profile, as this will force all users, no matter how small of a cluster, to have double the resources used which is often a waste. Its also harder to debug, etc 2 replicas. I would rather have:
- no PDB and have a "HA" profile with PDB+min replicas=2+
- Kubernetes change to allow scaling up to meet the constraints of a PDB
Agree. It is a shame min replicas can not be zero unless alpha gates are turned on in K8s 1.18. That would be an easier answer than adding a profile. We do need to take care not to introduce profile proliferation. 👍 But turning the suggestion around, wouldn't it be better for the default profile to include HA, and a different profile to disable PDBs? We could add a notice that the default profile includes Xyz features, including HA capability to the profile - to be displayed by istioctl. @esnible thoughts? Cheers, |
Throwing in my two cents, my response to both of these would be that I generally think of security, availability matters as things you have turned on by default -- best practices for production use turned on by default. One could also describe that as "making the hard things easy", there's then little effort in achieving the best practice. Then if there is a need, have a very explicit opt-out capability for PoCs or smaller clusters etc. An explicit opt-out means one has to actively acknowledge the ramifications of that choice. Perhaps worthwhile to note, the decision also impacts the way one thinks of things, either as maintainer or as user. In opt-out, HA is the default and "forefront" with non-HA the "exception", whereas opt-in, HA is the "exception" and non-HA is forefront.
I think this is relevant too, as the current defaults already include a PDB together with a misconfigured default HPA (that this PR attempts to fix). Also, there is the existing Y'all obviously understand Istio significantly better than me, but thought I'd throw in my two cents too as a k8s admin and user (as well as a library author and contributor myself). I occasionally find myself searching for HA and secure configuration in various charts / components and finding them missing because the defaults lack them -- adding them is considerable work and even if they exist they may not be as well-tested as the defaults. But best practices shouldn't be after thoughts (not that the Istio team treats them as such, just saying generally), especially for components that have wide impact on the cluster. |
@morvencao would be happy to try adding that (perhaps in a separate PR so as to not block this) if you could point me in the right direction of where that might go. Per my review notes, I don't really know where to look for something like that or tests.
@sdake The issue(s) pointed to changing this to 2, but I do think it is necessary myself as well. If only 1 replica is available (whether that's set by HPA or the default |
I opened kubernetes/kubernetes#93476 about this
on k8s side. I don't think we should just wait on k8s, but I am pretty
convinced this is poor (but expected) behavior on k8s side.
…On Mon, Jul 27, 2020 at 9:41 AM Anton Gilgur ***@***.***> wrote:
Need to add validation logic to resolve issue, not only change the default
value for PDB and HPA.
@morvencao <https://github.com/morvencao> would be happy to try adding
that (perhaps in a separate PR so as to not block this) if you could point
me in the right direction of where that might go. Per my review notes, I
don't really know where to look for something like that or tests.
I don't understand why the hpa spec replicas are increased to 2. Seems
like it would be sufficient to keep hpa spec at 1/5, and then increase the
replicas to 2.
The issue(s) pointed to changing this to 2, but I do think it is necessary
myself as well. If only 1 replica is available (whether that's set by HPA
or the default replicaCount isn't relevant as I understand), the PDB will
result in an error response to an eviction request. The HPA only scales up
with response to utilization, not a pending eviction. But correct me if I'm
wrong, I'm certainly not an expert in this.
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#25873 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEYGXNHTDPT2HVQJUDMS6DR5WU2NANCNFSM4PIGMEOQ>
.
|
On the k8s UX side, I actually thought a rolling update strategy might apply to the PDB case too (that similarly means a user is ok with some level of scale up), but that's only for the resource's rolling update itself, not if the underlying nodes go through a rolling update. All of these relate to each other, but are independent without knowledge of each other on the k8s side I guess. |
@agilgur5: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I wonder if a better solution is to configure |
@rcernich Reading https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-surge - I don't immediately understand your proposal. The default (according to that page) is 25%. This means the max surge during a rolling update with our defaults of 5 max pods = 6 pods. This should enable 1 extra pod for a rolling update. Not sure why we don't see this behavior automatically? I still believe we have two choices:
I think default-ha seems like a reasonable approach, although as a project we do not want a HA profile version for every profile in the system. An entirely different option is to have a conditional set of HA flags such as
@ostromart @costinm @esnible @howardjohn can you comment on the above. I think we have punted on this problem for about three years, and its time to take a hard look at solving it. Cheers, |
I 100% agree but I don't think k8s does this: kubernetes/kubernetes#93476. If the issue is invalid and we can do it, please let us know, that would be perfect. I don't see why we need a profile to configure a single flag. You can just set |
- PDBs for Istio services have a minAvailable of 1, and HPAs had a minReplicas of 1 (as well as default replicaCount of 1), meaning that the defaults would cause failures for things like kops rolling-update - bump minReplicas to 2 so that at least 1 pod can go down by default - and bump autoScaleMin to 2 as well, which seems to correspond to HPA minReplicas as well - and bump default replicaCount to 2 where it was previously 1
9fe5b31
to
b06be67
Compare
Yes, per my previous comment (#25873 (comment)) it doesn't currently do this, There does not seem to be much bite on the k8s issue filed kubernetes/kubernetes#93476. To me, that too seems to perhaps be out-of-scope; HPA is then scaling not for load as it is purposed, but to meet PDB requirements, which also couples the two resources together.
Per the diff, it's for several services -- personally, I experienced the PDB error with
Per my previous comment (#25873 (comment)), if the goal is to cater to people "trying Istio out", the existing In any case, the defaults are still currently incorrect and a confusing issue that lots of users run into (including those trying it out). I've fixed the one review comment about |
🚧 This issue or pull request has been closed due to not having had activity from an Istio team member since 2020-09-28. If you feel this issue or pull request deserves attention, please reopen the issue. Please see this wiki page for more information. Thank you for your contributions. Created by the issue and PR lifecycle manager. |
I do think this still needs attention, but I can't re-open this PR without making a new one and haven't gotten responses 😕 |
hit this bug again over a year after my original PR, now with Istio 1.10 with a different cluster at a different job... so this is still a bug... 😕 |
I still have |
Description
PDBs for Istio services have a minAvailable of 1, and HPAs had a
minReplicas of 1 (as well as default replicaCount of 1), meaning that
the defaults would cause failures for things like kops rolling-update
bump minReplicas to 2 so that at least 1 pod can go down by default
HPA minReplicas as well
also bump maxReplicas to 10 to account for the doubling of minReplicasand autoScaleMax respectivelyEDIT: removed according to reviewTags
Fixes #24000 . EDIT: Also fixes #12602 which duplicates it. Or perhaps the former isn't fully resolved without adding a validation piece.
[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[ ] Networking
[x] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure
This changes the defaults, so maybe it affects "Installation" too? And it does change (only) Configuration, but not "Configuration Infrastructure". Let me know if I should check more, I was not really sure and these areas weren't described in the contributing guidelines.
Review Notes
Hi there, first-time contributor here 👋 I was affected by #24000 when performing rolling-updates and saw it was marked as "good first issue", so thought I would try my hand at fixing it. Please let me know if any changes are necessary and offer any guidance 🙏
Things I wasn't totally sure of:
Per the description, I added
autoScaleMin
as I saw it was used in several places andminReplicas
was set to it, but let me know if that's wrong.I doubledEDIT removed according to reviewmaxReplicas
/autoScaleMax
to correspond to the doubled minimum, similar to some sample code, but if this should be left unchanged, can undo it.I think
replicaCount
needs to be set to2
in more places and perhaps in some places. I only changed the ones that were previously set for safety, but can add more. May need guidance on that.Not really sure how to test this. Ideally would compare PDB
minAvailable
to defaultreplicaCount
andminReplicas
, but those can be in lots of different places and I make no claims I understand how the multitudes of configs here 8000 work 😅 If someone could tell me where to look / guide me in the right direction, that would be helpful!