Advice requested - passing on Jobs while running old code versions on rolling deployments #1774

spetoolio · 2021-06-23T18:23:40Z

spetoolio
Jun 23, 2021

We use django-rq across a multi-server fleet, and use rolling-deployments to deploy new code 1 server at a time. This leaves significant time windows where different servers are running different code versions.

Our load-balancers properly redirect traffic away from servers running old code versions, but those servers have RQ workers that will continue to pick up jobs out of the queues.

We are very interested in any advice on how we could override either the worker class or queue class, in order to verify the code version of the running worker, before taking the job out of the queue and running it on that worker.

Basically, a "filter" for which jobs a worker takes. The worker can check its code version vs the "leader" code version, and pass on any jobs if they do not match until it's updated and restarted. I hope this makes sense!

Thanks to anyone who has devised a solution for this.

selwin · 2021-08-21T12:15:40Z

selwin
Aug 21, 2021
Maintainer

Since your load balancers are already redirecting traffic away from servers running old code versions, I think it makes sense to also stop RQ workers running on those servers and start again after the new code is properly deployed.

Does this make sense?

Having said that, I'm also open to ideas about implementing some kind of "filter" such that certain workers will skip certain jobs, but I'm not sure how we can achieve that in a scalable way.

0 replies

spetoolio · 2021-08-23T15:24:53Z

spetoolio
Aug 23, 2021
Author

Hey @selwin thanks for the reply! I misspoke about our load balancers' capabilities. They don't detect old code versions, they only detect servers undergoing deployment. So in a 3-server deployment A-B-C, server B will be not receive traffic during its "deployment", while A and C will receive traffic using different code versions. This is typically fine, but with workers server A could enqueue a Job for a function that C doesn't have yet (since it's behind a code version). If server C's workers pick up that job, it fails.

The only thing we can come up with so far is not a "filter", but overrides check_for_suspension on the Worker. Here's a snippet:

import time
from rq.worker import SimpleWorker


class AssertCodeVersionSimpleWorker(SimpleWorker):
    def check_appropriate_code_version(self):
        # TODO: DETERMINE IF THE CODE VERSION IS WRONG - HOW DO WE DO THIS?
        return result

    def check_for_suspension(self, *args, **kwargs):
        before_state = None
        notified = False
        super().check_for_suspension(*args, **kwargs)
        wrong_code_version = not self.check_appropriate_code_version()
        while wrong_code_version and not self._stop_requested:
            if not notified:
                self.log.info('Worker suspended due to outdated code version')
                before_state = self.get_state()
                self.set_state(WorkerStatus.SUSPENDED)
                notified = True
            time.sleep(5)  # we'll probably use 1 second in production
            wrong_code_version = not self.check_appropriate_code_version()

        if before_state:
            self.set_state(before_state)

Do you see any issues with this implementation? It works well in our isolated tests (although the biggest struggle right now is the code for check_appropriate_code_version, which has nothing to do with RQ)

I could see some scalable way of doing this. This implementation of check_for_suspension is essentially the same as Worker's implementation, with the additional condition of check_appropriate_code_version. Seems to be that all that would be needed is a list of callables that return booleans to determine suspension.

python manage.py rqworker --worker-class SimpleWorker --suspension-callable mymodule.functions.check_appropriate_code_version default

0 replies

JhonnyBn · 2021-10-20T19:12:33Z

JhonnyBn
Oct 20, 2021

Maybe you could use a different Queue for the other version? Old workers with 'old' queues and new workers with 'new' queues for example
Never tried different queues to be honest, but it seems like a solution

0 replies

ccrvlh · 2023-01-28T04:28:22Z

ccrvlh
Jan 28, 2023
Collaborator

This is an interesting issue indeed. However interesting this may be, I do think this will always be very specific to the use case/environment: what is check_appropriate_code_version, how one determines what is "appropriate" or not is mostly related specific to the application. For that reason, I don't really think this is something RQ should handle.

This does seem a perfect example of how useful custom Worker/Job/Queue classes can be to solve complex issues, though.

I'm moving this to a Discussion.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advice requested - passing on Jobs while running old code versions on rolling deployments #1774

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Advice requested - passing on Jobs while running old code versions on rolling deployments #1774

spetoolio Jun 23, 2021

Replies: 4 comments

selwin Aug 21, 2021 Maintainer

spetoolio Aug 23, 2021 Author

JhonnyBn Oct 20, 2021

ccrvlh Jan 28, 2023 Collaborator

spetoolio
Jun 23, 2021

selwin
Aug 21, 2021
Maintainer

spetoolio
Aug 23, 2021
Author

JhonnyBn
Oct 20, 2021

ccrvlh
Jan 28, 2023
Collaborator