8000 Add wait conditions to combat flakiness by adleong · Pull Request #11368 · linkerd/linkerd2 · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add wait conditions to combat flakiness #11368

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 16, 2023
Merged

Add wait conditions to combat flakiness #11368

merged 3 commits into from
Sep 16, 2023

Conversation

adleong
Copy link
Member
@adleong adleong commented Sep 14, 2023

We intermittently see flaky policy integration test failures like:

 failures:
    either

thread 'either' panicked at 'assertion failed: `(left == right)`
  left: `7`,
 right: `0`: blessed uninjected curl must succeed', policy-test/tests/e2e_server_authorization.rs:293:9

This test failure is saying that the curl process is returning an exit code of 7 instead of the expected exit code of 0. This exit code indicates that curl failed to establish a connection. https://everything.curl.dev/usingcurl/returns

It's unclear why this connection occasionally fails in CI and I have not been able to reproduce this failure locally.

However, by looking at the logic of the integration test, we can see that the integration test creates the web Service and the web Pod and waits for that pod to become ready before unblocking the curl from executing. This means that, theoretically, there could be a race condition between the test and the kubernetes endpoints controller. As soon as the web pod becomes ready, the endpoints controller will update the endpoints resource for the web Service and at the same time, our test will unblock the curl command. If the test wins this race, it is possible that curl will run before the endpoints resource has been updated.

We add an additional wait condition to the test to wait until the endpoints resource has an endpoint before unblocking curl.

Since I could not reproduce the test failure locally, it is impossible to say if this is actually the cause of the flakiness or if this change fixes it.

Signed-off-by: Alex Leong <alex@buoyant.io>
Signed-off-by: Alex Leong <alex@buoyant.io>
@adleong adleong requested a review from a team as a code owner September 14, 2023 00:20
@adleong
Copy link
Member Author
adleong commented Sep 14, 2023

Ah, the irony of hitting a different test flakiness. For the record:

thread 'inbound_accepted_reconcile_parent_delete' panicked at 'assertion failed: `(left == right)`
  left: `"False"`,
 right: `"True"`', policy-test/tests/inbound_http_route_status.rs:278:9

I will rerun the integration tests and investigate that flakiness separately.

Comment on lines 300 to 306
let endpoints_ready = |obj: Option<&k8s::Endpoints>| -> bool {
if let Some(ep) = obj {
return ep.subsets.iter().flatten().count() > 0;
}
false
};
await_condition(&client, &ns, "web", endpoints_ready).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit/TIOLI: is there a reason endpoints_ready has to be repeated in all these tests, or can it be a helper in linkerd_policy_test?

Signed-off-by: Alex Leong <alex@buoyant.io>
Copy link
Contributor
@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me! i commented on a couple of nits that you're welcome to address or ignore :)

Comment on lines +173 to +176
if let Some(ep) = obj {
return ep.subsets.iter().flatten().count() > 0;
}
false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, take it or leave it: could also be represented as

Suggested change
if let Some(ep) = obj {
return ep.subsets.iter().flatten().count() > 0;
}
false
obj.into_iter().flat_map(|ep| ep.subsets.iter().flatten()).count() > 0

although 🤷‍♀️ if that's any clearer,

@@ -168,6 +168,14 @@ pub async fn await_route_status(
.inner
}

// Wait for the endpoints controller to populate the Endpoints resource.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, not a big deal:

Suggested change
// Wait for the endpoints controller to populate the Endpoints resource.
/// Returns `true` if the endpoints controller has populated an Endpoints resource.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could leave the "// wait for..." comment in the places where this is actually used with await_condition, if you want, but not a big deal either way.

@hawkw hawkw requested a review from a team September 14, 2023 22:50
@adleong adleong merged commit 8579c10 into main Sep 16, 2023
@adleong adleong deleted the alex/flake branch September 16, 2023 00:05
adamshawvipps pushed a commit to adamshawvipps/linkerd2 that referenced this pull request Sep 18, 2023
We intermittently see flaky policy integration test failures like:

```
 failures:
    either

thread 'either' panicked at 'assertion failed: `(left == right)`
  left: `7`,
 right: `0`: blessed uninjected curl must succeed', policy-test/tests/e2e_server_authorization.rs:293:9
```

This test failure is saying that the curl process is returning an exit code of 7 instead of the expected exit code of 0.  This exit code indicates that curl failed to establish a connection.  https://everything.curl.dev/usingcurl/returns

It's unclear why this connection occasionally fails in CI and I have not been able to reproduce this failure locally.

However, by looking at the logic of the integration test, we can see that the integration test creates the `web` Service and the `web` Pod and waits for that pod to become ready before unblocking the curl from executing.  This means that, theoretically, there could be a race condition between the test and the kubernetes endpoints controller.  As soon as the web pod becomes ready, the endpoints controller will update the endpoints resource for the `web` Service and at the same time, our test will unblock the curl command.  If the test wins this race, it is possible that curl will run before the endpoints resource has been updated.

We add an additional wait condition to the test to wait until the endpoints resource has an endpoint before unblocking curl.

Since I could not reproduce the test failure locally, it is impossible to say if this is actually the cause of the flakiness or if this change fixes it.

Signed-off-by: Alex Leong <alex@buoyant.io>
adamshawvipps pushed a commit to adamshawvipps/linkerd2 that referenced this pull request Sep 18, 2023
We intermittently see flaky policy integration test failures like:

```
 failures:
    either

thread 'either' panicked at 'assertion failed: `(left == right)`
  left: `7`,
 right: `0`: blessed uninjected curl must succeed', policy-test/tests/e2e_server_authorization.rs:293:9
```

This test failure is saying that the curl process is returning an exit code of 7 instead of the expected exit code of 0.  This exit code indicates that curl failed to establish a connection.  https://everything.curl.dev/usingcurl/returns

It's unclear why this connection occasionally fails in CI and I have not been able to reproduce this failure locally.

However, by looking at the logic of the integration test, we can see that the integration test creates the `web` Service and the `web` Pod and waits for that pod to become ready before unblocking the curl from executing.  This means that, theoretically, there could be a race condition between the test and the kubernetes endpoints controller.  As soon as the web pod becomes ready, the endpoints controller will update the endpoints resource for the `web` Service and at the same time, our test will unblock the curl command.  If the test wins this race, it is possible that curl will run before the endpoints resource has been updated.

We add an additional wait con
9664
dition to the test to wait until the endpoints resource has an endpoint before unblocking curl.

Since I could not reproduce the test failure locally, it is impossible to say if this is actually the cause of the flakiness or if this change fixes it.

Signed-off-by: Alex Leong <alex@buoyant.io>
Signed-off-by: Adam Shaw <adam.shaw@vipps.no>
adleong added a commit that referenced this pull request Aug 22, 2024
#11368 added a step to certain e2e integration tests where we await the endpoints becoming ready before attempting to send traffic to them.  This was done to combat flakyness on those tests.

We have observed flakyness in other similar tests, `targets_route` in particular.  We add the same await step to that test and to all other tests in that form.

Given the nature of flaky tests, it's difficult to confirm that this fixes the flakyness.

Signed-off-by: Alex Leong <alex@buoyant.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0