Add wait conditions to combat flakiness #11368

adleong · 2023-09-14T00:20:07Z

We intermittently see flaky policy integration test failures like:

 failures:
    either

thread 'either' panicked at 'assertion failed: `(left == right)`
  left: `7`,
 right: `0`: blessed uninjected curl must succeed', policy-test/tests/e2e_server_authorization.rs:293:9

This test failure is saying that the curl process is returning an exit code of 7 instead of the expected exit code of 0. This exit code indicates that curl failed to establish a connection. https://everything.curl.dev/usingcurl/returns

It's unclear why this connection occasionally fails in CI and I have not been able to reproduce this failure locally.

However, by looking at the logic of the integration test, we can see that the integration test creates the web Service and the web Pod and waits for that pod to become ready before unblocking the curl from executing. This means that, theoretically, there could be a race condition between the test and the kubernetes endpoints controller. As soon as the web pod becomes ready, the endpoints controller will update the endpoints resource for the web Service and at the same time, our test will unblock the curl command. If the test wins this race, it is possible that curl will run before the endpoints resource has been updated.

We add an additional wait condition to the test to wait until the endpoints resource has an endpoint before unblocking curl.

Since I could not reproduce the test failure locally, it is impossible to say if this is actually the cause of the flakiness or if this change fixes it.

Signed-off-by: Alex Leong <alex@buoyant.io>

adleong · 2023-09-14T00:47:37Z

Ah, the irony of hitting a different test flakiness. For the record:

thread 'inbound_accepted_reconcile_parent_delete' panicked at 'assertion failed: `(left == right)`
  left: `"False"`,
 right: `"True"`', policy-test/tests/inbound_http_route_status.rs:278:9

I will rerun the integration tests and investigate that flakiness separately.

hawkw · 2023-09-14T01:00:18Z

policy-test/tests/e2e_authorization_policy.rs

+        let endpoints_ready = |obj: Option<&k8s::Endpoints>| -> bool {
+            if let Some(ep) = obj {
+                return ep.subsets.iter().flatten().count() > 0;
+            }
+            false
+        };
+        await_condition(&client, &ns, "web", endpoints_ready).await;


nit/TIOLI: is there a reason endpoints_ready has to be repeated in all these tests, or can it be a helper in linkerd_policy_test?

Signed-off-by: Alex Leong <alex@buoyant.io>

hawkw

looks good to me! i commented on a couple of nits that you're welcome to address or ignore :)

hawkw · 2023-09-14T22:48:51Z

policy-test/src/lib.rs

+    if let Some(ep) = obj {
+        return ep.subsets.iter().flatten().count() > 0;
+    }
+    false


nit, take it or leave it: could also be represented as

Suggested change

if let Some(ep) = obj {

return ep.subsets.iter().flatten().count() > 0;

}

false

obj.into_iter().flat_map(|ep| ep.subsets.iter().flatten()).count() > 0

although 🤷‍♀️ if that's any clearer,

hawkw · 2023-09-14T22:49:23Z

policy-test/src/lib.rs

@@ -168,6 +168,14 @@ pub async fn await_route_status(
    .inner
 }

+// Wait for the endpoints controller to populate the Endpoints resource.


nit, not a big deal:

Suggested change

// Wait for the endpoints controller to populate the Endpoints resource.

/// Returns `true` if the endpoints controller has populated an Endpoints resource.

we could leave the "// wait for..." comment in the places where this is actually used with await_condition, if you want, but not a big deal either way.

We intermittently see flaky policy integration test failures like: ``` failures: either thread 'either' panicked at 'assertion failed: `(left == right)` left: `7`, right: `0`: blessed uninjected curl must succeed', policy-test/tests/e2e_server_authorization.rs:293:9 ``` This test failure is saying that the curl process is returning an exit code of 7 instead of the expected exit code of 0. This exit code indicates that curl failed to establish a connection. https://everything.curl.dev/usingcurl/returns It's unclear why this connection occasionally fails in CI and I have not been able to reproduce this failure locally. However, by looking at the logic of the integration test, we can see that the integration test creates the `web` Service and the `web` Pod and waits for that pod to become ready before unblocking the curl from executing. This means that, theoretically, there could be a race condition between the test and the kubernetes endpoints controller. As soon as the web pod becomes ready, the endpoints controller will update the endpoints resource for the `web` Service and at the same time, our test will unblock the curl command. If the test wins this race, it is possible that curl will run before the endpoints resource has been updated. We add an additional wait condition to the test to wait until the endpoints resource has an endpoint before unblocking curl. Since I could not reproduce the test failure locally, it is impossible to say if this is actually the cause of the flakiness or if this change fixes it. Signed-off-by: Alex Leong <alex@buoyant.io>

We intermittently see flaky policy integration test failures like: ``` failures: either thread 'either' panicked at 'assertion failed: `(left == right)` left: `7`, right: `0`: blessed uninjected curl must succeed', policy-test/tests/e2e_server_authorization.rs:293:9 ``` This test failure is saying that the curl process is returning an exit code of 7 instead of the expected exit code of 0. This exit code indicates that curl failed to establish a connection. https://everything.curl.dev/usingcurl/returns It's unclear why this connection occasionally fails in CI and I have not been able to reproduce this failure locally. However, by looking at the logic of the integration test, we can see that the integration test creates the `web` Service and the `web` Pod and waits for that pod to become ready before unblocking the curl from executing. This means that, theoretically, there could be a race condition between the test and the kubernetes endpoints controller. As soon as the web pod becomes ready, the endpoints controller will update the endpoints resource for the `web` Service and at the same time, our test will unblock the curl command. If the test wins this race, it is possible that curl will run before the endpoints resource has been updated. We add an additional wait con 9664 dition to the test to wait until the endpoints resource has an endpoint before unblocking curl. Since I could not reproduce the test failure locally, it is impossible to say if this is actually the cause of the flakiness or if this change fixes it. Signed-off-by: Alex Leong <alex@buoyant.io> Signed-off-by: Adam Shaw <adam.shaw@vipps.no>

#11368 added a step to certain e2e integration tests where we await the endpoints becoming ready before attempting to send traffic to them. This was done to combat flakyness on those tests. We have observed flakyness in other similar tests, `targets_route` in particular. We add the same await step to that test and to all other tests in that form. Given the nature of flaky tests, it's difficult to confirm that this fixes the flakyness. Signed-off-by: Alex Leong <alex@buoyant.io>

adleong added 2 commits September 13, 2023 23:53

Add wait conditions to help test flakiness.

42869cb

Signed-off-by: Alex Leong <alex@buoyant.io>

Add wait conditions to more places

c0ca504

Signed-off-by: Alex Leong <alex@buoyant.io>

adleong requested a review from a team as a code owner September 14, 2023 00:20

hawkw approved these changes Sep 14, 2023

View reviewed changes

DRY

7331095

Signed-off-by: Alex Leong <alex@buoyant.io>

hawkw approved these changes Sep 14, 2023

View reviewed changes

hawkw requested a review from a team September 14, 2023 22:50

alpeb approved these changes Sep 15, 2023

View reviewed changes

adleong merged commit 8579c10 into main Sep 16, 2023

adleong deleted the alex/flake branch September 16, 2023 00:05

adleong mentioned this pull request Aug 8, 2024

await endpoints ready in more e2e tests to combat flakyness #12947

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add wait conditions to combat flakiness #11368

Add wait conditions to combat flakiness #11368

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	// Wait for the endpoints controller to populate the Endpoints resource.
	/// Returns `true` if the endpoints controller has populated an Endpoints resource.

Add wait conditions to combat flakiness #11368

Add wait conditions to combat flakiness #11368

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!