8000 Troubleshooting Istio · istio/istio Wiki · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Troubleshooting Istio

John Howard edited 10000 this page Oct 18, 2024 · 16 revisions

Troubleshooting Istio

Ambient mode

See here

Sidecar Injection

If your pods are failing to start, look into the MutatingAdmissionWebhook istio-sidecar-injector. When a pod is created, the Kubernetes api-server will call the sidecar injector service (Istiod). Errors during injection, or failure to connect to the service, can result in pods not being created.

These errors may look something like failed calling webhook "sidecar-injector.istio.io": Post https://istiod.istio-system.svc:443/inject?timeout=30s: context deadline exceeded.

Collecting Information

The replica set will generally contain any error messages. Gather this information with kubectl describe replicaset REPLICA_SET > replicaset.txt.

To get logs from Istiod, run: kubectl logs -n istio-system -l app=istiod --tail=100000000 > istiod.log.

To get the injection template: kubectl -n istio-system get configmap istio-sidecar-injector -o jsonpath={.data.config} > template.yaml

Diagnostics

Injection works by the API server connecting to the webhook deployment (Istiod). This may cause issues if there are connectivity issues, such as firewalls, blocking this call. Depending on the Kubernetes configuration, this may required a firewall rule on port 443 or port 15017; instructions for doing so on GKE can be found here.

In order to check if the API server can access the pod, we can send a request proxied through the api server:

An example of a request that succeeds (no body found is returned from the service and indicates we do have connectivity):

$ kubectl get --raw /api/v1/namespaces/istio-system/services/https:istiod:https-webhook/proxy/inject -v4
I0618 07:39:46.663871   36880 helpers.go:216] server response object: [{
  "metadata": {},
  "status": "Failure",
  "message": "the server rejected our request for an unknown reason",
  "reason": "BadRequest",
  "details": {
    "causes": [
      {
        "reason": "UnexpectedServerResponse",
        "message": "no body found"
      }
    ]
  },
  "code": 400
}]
F0618 07:39:46.663940   36880 helpers.go:115] Error from server (BadRequest): the server rejected our request for an unknown reason

Similarly, we can send a request from another pod:

$ curl https://istiod.istio-system:443/inject -k
no body found

And from the istiod pod directly (note: the port here is 15017, as this is the targetPort for the Service):

$ curl https://localhost:15017/inject -k
no body found

With this information you should be able to isolate where the breakage occurs.

Istiod

Collecting information

To capture logs: kubectl logs -n istio-system -l app=istiod --tail=100000000 -c discovery > istiod.log.

To capture mesh config: kubectl get configmap -n istio-system -o jsonpath={.data.mesh} istio > meshconfig.yaml

To capture a proxy config dump from Istiod perspective: kubectl exec ISTIOD_POD -- curl 'localhost:8080/debug/config_dump?proxyID=POD_NAME.POD_NAMESPACE',

Capture a snapshot of the Istio Control Plane dashboard. Prefer this to a screenshot if possible, as it allows zooming, etc.

Performance Issues

If you are experiencing performance issues with Istiod, such as excessive CPU or memory usage, memory leaks, etc, it is helpful to capture profiles. Please see this page for help.

Sidecar/Gateway problems

Collecting Information

To get configuration and stats from a proxy (gateway or sidecar):

  • Stats: kubectl exec $POD -c istio-proxy -- curl 'localhost:15000/stats' > stats
  • Config Dump: kubectl exec $POD -c istio-proxy -- curl 'localhost:15000/config_dump' > config_dump.json OR istioctl proxy-config all $POD -ojson > config_dump.json
  • Clusters Dump: kubectl exec $POD -c istio-proxy -- curl 'localhost:15000/clusters' > clusters
  • Logs: kubectl logs $POD -c istio-proxy > proxy.log

To enable debug logging, which may be useful if the default log does not provide enough information:

  • At runtime: istioctl proxy-config log POD --level=debug
  • For a pod, set annotation: sidecar.istio.io/logLevel: "debug"
  • For the whole mesh, install with --set values.global.proxy.logLevel=debug

To enable access logging, which may be useful to debug traffic, see here. More info about access log format can be found in Envoy docs.

Core Dumps

If Envoy crashes, a core dump may be generated. If so, you will get a log line like: Envoy exited with error: signal: segmentation fault (core dumped).

It is a best practice to treat crashes as potential security issues and report these privately.

When doing so, including the core file is helpful. The core will be dumped to /var/lib/istio/data/core.proxy.XXX, where XXX may change:

$ kubectl exec shell-5b7cf9f6c4-x8krc -c istio-proxy -- ls /var/lib/istio/data
core.proxy.19
$ kubectl cp -c istio-proxy shell-5b7cf9f6c4-x8krc:/var/lib/istio/data/core.proxy.19 core

You may need to enable core dumps on your cluster first.

Performance Issues

See Analyzing Istio Performance

High Istiod CPU usage / excessive pushes

Istiod pushes updates to proxies in response to Kubernetes objects changing (Services, Pods, Istio configs, etc). The size/cost of these updates is also correlate to the amount of configuration and the number of proxies that need updates. When experiencing high Istiod CPU usage, this is typically due to too frequent or too large updates.

Both of these can often be addressed by scoping down dependencies through various methods.

Another common issue is constant updates to objects, typically done by controllers constantly updating some field such as an annotation. To diagnose what is causing these pushes, it is best to look at Istiod logs:

info    ads     Push debounce stable[21] 2 for config ServiceEntry/echo/vm.echo.svc.cluster.local: 100.614843ms since last change, 110.109891ms since last push, full=true

This log line indicates the start of the push. Included is the configuration that caused the update (in this case, the vm Service in namespace echo), as well as the type of push (full=true). Updates with full=false are pretty cheap and are triggered by endpoint updates. full=true are generally the problem.

If there are many changes at once, not all objects will be shown. If this prevents diagnosis, there is an additional log that can be turned on (istioctl admin log --level fullpush=debug at runtime out --log_output_level=fullpush:debug at startup.

To diagnose these, look for what configurations are being updated and inspect the Kubernetes objects for changes. To watch for diffs, the kubectl-grep tool can be used.

Note: ServiceEntry is used for updates to Service as well.

Common Issues

  • gRPC config stream closed: 13 or gRPC config stream closed: 0 in proxy logs, every 30 minutes. This error message is expected, as the connection to Pilot is intentionally closed every 30 minutes.
  • gRPC config stream closed: 14 in proxy logs. If this occurs repeatedly it may indicate problems connecting to Pilot. However, a single occurance of this is typical when Envoy is starting or restarting.

Mutual TLS troubleshooting

If you suspect auto mTLS is not working as expected, please first read the documentation.

Auto mTLS has a known limitation with workload level peer authentication. When workload level peer authentication is used, you need to configure a corresponding destination rule configuration to let client sidecar send mTLS or plaintext traffic correctly.

If you still think there's an issue, follow the instructions below for investigation.

  1. Get the client envoy cluster configuration to ensure auto mTLS is configured. You should see transportSocketMatches configured for the given Envoy cluster.

    $ istioctl proxy-config clusters ${POD}.foo -ojson | grep 'name.*outbound.*httpbin' -A130
    "name": "outbound|8000||httpbin.default.svc.cluster.local",
    ...
    "transportSocketMatches": [
            {
                "name": "tlsMode-istio",
                "match": {
                    "tlsMode": "istio"
                },
                "transportSocket": {
                    "name": "envoy.transport_sockets.tls",
                    "typedConfig": {
                        "@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext",
                        "commonTlsContext": ...
            }
            {
                "name": "tlsMode-disabled",
                "match": {},
                "transportSocket": {
                    "name": "envoy.transport_sockets.raw_buffer"
                }
            }
  2. Check the destination pod to see if the destination endpoint has the label tlsMode equals to istio. For example, in the EDS response received by sleep pod, we see httpbin pod endpoint has the tlsMode label.

    $ IP=$(kubectl get pod -lapp=sleep -o jsonpath="{.items[*].status.podIP}")
    $ kubectl exec -it ${POD} -c istio-proxy -- curl 'localhost:15000/config_dump?include_eds=true' | grep ${IP} -A15 -B5
      "endpoint": {
        "address": {
        "socket_address": {
            "address": "10.32.28.136",
            "port_value": 80
        }
        ...
        "metadata": {
          "filter_metadata": {
            "envoy.transport_socket_match": {
              "tlsMode": "istio"
            }
        },
  3. Check Envoy upstream debug log to confirm what transport socket configuration is being used. This requires to change Envoy upstream log to debug level.

    $ istioctl pc log ${POD} --level debug
    $ kubectl logs -f ${POD} -c istio-proxy
    2021-04-08T22:35:22.650478Z     debug   envoy upstream  transport socket match, socket tlsMode-istio selected for host with address 10.32.28.136:80

    The log line above indicates that mTLS socket is selected for the connection to the corresponding host. Please note that the connection creation might be done in advance of the actual request, if you don't see such log line, try to recreate destination side endpoint.

Troubleshooting

Dev Environment

Writing Code

Pull Requests

Testing

Performance

Releases

Misc

Central Istiod

Pilot

Telemetry

Clone this wiki locally
0