-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Add wait conditions to combat flakiness #11368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Alex Leong <alex@buoyant.io>
Signed-off-by: Alex Leong <alex@buoyant.io>
Ah, the irony of hitting a different test flakiness. For the record:
I will rerun the integration tests and investigate that flakiness separately. |
let endpoints_ready = |obj: Option<&k8s::Endpoints>| -> bool { | ||
if let Some(ep) = obj { | ||
return ep.subsets.iter().flatten().count() > 0; | ||
} | ||
false | ||
}; | ||
await_condition(&client, &ns, "web", endpoints_ready).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit/TIOLI: is there a reason endpoints_ready
has to be repeated in all these tests, or can it be a helper in linkerd_policy_test
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me! i commented on a couple of nits that you're welcome to address or ignore :)
if let Some(ep) = obj { | ||
return ep.subsets.iter().flatten().count() > 0; | ||
} | ||
false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, take it or leave it: could also be represented as
if let Some(ep) = obj { | |
return ep.subsets.iter().flatten().count() > 0; | |
} | |
false | |
obj.into_iter().flat_map(|ep| ep.subsets.iter().flatten()).count() > 0 |
although 🤷♀️ if that's any clearer,
@@ -168,6 +168,14 @@ pub async fn await_route_status( | |||
.inner | |||
} | |||
|
|||
// Wait for the endpoints controller to populate the Endpoints resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, not a big deal:
// Wait for the endpoints controller to populate the Endpoints resource. | |
/// Returns `true` if the endpoints controller has populated an Endpoints resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could leave the "// wait for..." comment in the places where this is actually used with await_condition
, if you want, but not a big deal either way.
We intermittently see flaky policy integration test failures like: ``` failures: either thread 'either' panicked at 'assertion failed: `(left == right)` left: `7`, right: `0`: blessed uninjected curl must succeed', policy-test/tests/e2e_server_authorization.rs:293:9 ``` This test failure is saying that the curl process is returning an exit code of 7 instead of the expected exit code of 0. This exit code indicates that curl failed to establish a connection. https://everything.curl.dev/usingcurl/returns It's unclear why this connection occasionally fails in CI and I have not been able to reproduce this failure locally. However, by looking at the logic of the integration test, we can see that the integration test creates the `web` Service and the `web` Pod and waits for that pod to become ready before unblocking the curl from executing. This means that, theoretically, there could be a race condition between the test and the kubernetes endpoints controller. As soon as the web pod becomes ready, the endpoints controller will update the endpoints resource for the `web` Service and at the same time, our test will unblock the curl command. If the test wins this race, it is possible that curl will run before the endpoints resource has been updated. We add an additional wait condition to the test to wait until the endpoints resource has an endpoint before unblocking curl. Since I could not reproduce the test failure locally, it is impossible to say if this is actually the cause of the flakiness or if this change fixes it. Signed-off-by: Alex Leong <alex@buoyant.io>
We intermittently see flaky policy integration test failures like: ``` failures: either thread 'either' panicked at 'assertion failed: `(left == right)` left: `7`, right: `0`: blessed uninjected curl must succeed', policy-test/tests/e2e_server_authorization.rs:293:9 ``` This test failure is saying that the curl process is returning an exit code of 7 instead of the expected exit code of 0. This exit code indicates that curl failed to establish a connection. https://everything.curl.dev/usingcurl/returns It's unclear why this connection occasionally fails in CI and I have not been able to reproduce this failure locally. However, by looking at the logic of the integration test, we can see that the integration test creates the `web` Service and the `web` Pod and waits for that pod to become ready before unblocking the curl from executing. This means that, theoretically, there could be a race condition between the test and the kubernetes endpoints controller. As soon as the web pod becomes ready, the endpoints controller will update the endpoints resource for the `web` Service and at the same time, our test will unblock the curl command. If the test wins this race, it is possible that curl will run before the endpoints resource has been updated. We add an additional wait con 9664 dition to the test to wait until the endpoints resource has an endpoint before unblocking curl. Since I could not reproduce the test failure locally, it is impossible to say if this is actually the cause of the flakiness or if this change fixes it. Signed-off-by: Alex Leong <alex@buoyant.io> Signed-off-by: Adam Shaw <adam.shaw@vipps.no>
#11368 added a step to certain e2e integration tests where we await the endpoints becoming ready before attempting to send traffic to them. This was done to combat flakyness on those tests. We have observed flakyness in other similar tests, `targets_route` in particular. We add the same await step to that test and to all other tests in that form. Given the nature of flaky tests, it's difficult to confirm that this fixes the flakyness. Signed-off-by: Alex Leong <alex@buoyant.io>
We intermittently see flaky policy integration test failures like:
This test failure is saying that the curl process is returning an exit code of 7 instead of the expected exit code of 0. This exit code indicates that curl failed to establish a connection. https://everything.curl.dev/usingcurl/returns
It's unclear why this connection occasionally fails in CI and I have not been able to reproduce this failure locally.
However, by looking at the logic of the integration test, we can see that the integration test creates the
web
Service and theweb
Pod and waits for that pod to become ready before unblocking the curl from executing. This means that, theoretically, there could be a race condition between the test and the kubernetes endpoints controller. As soon as the web pod becomes ready, the endpoints controller will update the endpoints resource for theweb
Service and at the same time, our test will unblock the curl command. If the test wins this race, it is possible that curl will run before the endpoints resource has been updated.We add an additional wait condition to the test to wait until the endpoints resource has an endpoint before unblocking curl.
Since I could not reproduce the test failure locally, it is impossible to say if this is actually the cause of the flakiness or if this change fixes it.