8000 how to do endpoints for non-webapp services · Issue #71 · mozilla-services/Dockerflow · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

how to do endpoints for non-webapp services #71

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
willkg opened this issue Dec 6, 2023 · 1 comment
Open

how to do endpoints for non-webapp services #71

willkg opened this issue Dec 6, 2023 · 1 comment

Comments

@willkg
Copy link
willkg commented Dec 6, 2023

For services that are not webapps, what does Dockerflow recommend we do for healthcheck endpoints?

For example, the Socorro processor is not a webapp and doesn't have anything to respond to HTTP, so there's nothing to implement healthchecks with.

Is it the case that all services must implement a webapp to handle Dockerflow healthcheck enpoints? Should we have something else for non-webapp services?

@jwhitlock
Copy link
Member

On https://github.com/mozilla/fx-private-relay, we implemented a pair of Django management commands to implement a liveness check to detect stalled processes.

process_email_from_sqs.py is a long-running process that loops to poll a AWS Queue and processes any emails. It periodically writes a healthcheck file to disk with the timestamp and some data. Email is unpredictable, and the standard library email processing expects spec-compliant emails, so there are uncaught exceptions that cause the process to crash. The AWS client library has some built-in retry logic, so connection issues can appear as a stuck process.

check_health.py is a second management command that attempts to read the healthcheck file. If it doesn't exist, or there is an issue like the in-data timestamp is too old, it exits with an error code. If everything is copacetic, it returns with a 0 error code for success.

The process_email_from_sqs.py command is run as a Kubernetes deployment with several replicas. The check_health.py command runs as a liveness probe. The spec looks something like this:

spec:
  containers:
    - command:
        - python
        - manage.py
        - process_emails_from_sqs
      livenessProbe:
        exec:
          command:
            - python
            - /app/manage.py
            - check_health
        failureThreshold: 5
        initialDelaySeconds: 5
        periodSeconds: 6
        successThreshold: 1
        timeoutSeconds: 5

We have hundreds of liveness probe failures a day according to Sentry, but require several in a row to terminate a process. It is more common for a process to terminate due to a uncaught exception, but the liveness check does prevent zombie replicas from sticking around until the next deployment.

I'm negative on a webservice for each background process, but we could re-implement this as a webservice that runs process_emails_from_sqs.py in a fork, sends health data over a pipe, and serves the health data at /__heartbeat__, with a proper status code for a stalled process. I don't think it would make much sense to expose this webservice to the world, it would just be for making a background service look like a web service.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
0