-
-
Notifications
You must be signed in to change notification settings - Fork 318
Shlink 4.4.0 failing to acquire lock with redis cluster: Cannot use 'SCRIPT' with redis-cluster #2366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks! You are not the first one reporting this #2365 I have the suspicion the root cause is the same as in #2350, although, the error message is slightly different. I downgraded the culprit library in v4.4.3, although, according to #2365, that didn't fix this particular issue. Can you put together a docker compose configuration where this can be consistently reproduced? It's going to be very hard to find the underlying root cause otherwise, and I won't be able to report it to the proper library. |
Ok, I was able to reproduce the issue with the same simple instructions I originally reported here symfony/symfony#59686, but using So basically, my guess is that the changes they introduced in v7.2.3 were meant to fix the issue reported here, but they ended up introducing the other issue. I would recommend you to try with Shlink 4.4.2, which uses If that's the case, then I'm afraid you'll have to stay in Shlink 4.3.x until they have released their fix. |
Sadly, I have just checked this same error is reproducible with the latest changes in |
The issue was introduced in |
Just reported it there symfony/symfony#59795 |
BTW @hezor, a short-term workaround would be to use a single redis instance with Shlink 4.4.3, until a fix has been provided for this. |
Thank you very much! In our use case, I think the added redundancy of a Redis cluster is more important than the latest features from 4.4.x - although the option to use TLS for the database connection is a very nice bonus and keeps the infosec people happy. 😁 |
I'm trying to help pinpoint the issue in the upstream library, but in the meantime I'm downgrading it in Shlink to fix this issue. I need to fix one more bug and then I'll release Shlink v4.4.4 |
I have just released Shlink 4.4.4 |
Shlink version
4.4.0-roadrunner
PHP version
Official Docker Image
How do you serve Shlink
Docker image
Database engine
MariaDB
Database version
10.11.10
Current behavior
This is going to be a long one as I want to be as thorough as possible, so please bear with me.
We run our Shlink setup in AWS. We have 3 Shlink instances running in ECS Fargate using the official Docker 4.4.0-roadrunner images, and they connect to an RDS-based MariaDB 10.11.10. All 3 Shlink instances use a MemoryDB for Redis (version 7.1) for it's lock files. This is a cluster of 3 nodes that only has 1 shard and 2 replicas per shard, and is completely dedicated to Shlink only.
The connection to Redis is made by setting the following envvar:
REDIS_SERVERS=tcp://nodename-redacted-redis-cluster-shlink-0001-001.abcdef.0001.memorydb.eu-north-1.amazonaws.com:6379,tcp://nodename-redacted-redis-cluster-shlink-0001-002.abcdef.0001.memorydb.eu-north-1.amazonaws.com:6379,tcp://nodename-redacted-redis-cluster-shlink-0001-003.abcdef.0001.memorydb.eu-north-1.amazonaws.com:6379
So all of the nodes are separately defined in the server list as the Shlink documentation suggests when using a Redis Cluster.
When we upgraded from version 4.3.1 to version 4.4.0, there was a change that we began encrypting the database connections from Shlink by using the envvar
DB_USE_ENCRYPTION=true
We originally launched Shlink back in October 2023 beginning with version 3.6.4 and have constantly updated Shlink to the latest version pretty soon after their release. We have had zero problems until there was a hiccup in AWS Stockholm region last Friday where AWS had serious problems with it's networking. This meant that the Redis for Shlink had some connectivity issues as well as the RDS.
Later, after the incident was resolved by AWS, our Shlink admins reported that they cannot create new short links as Shlink Web Client reports "An unknown error occurred". I started digging in and noticed that Shlink is responding with HTTP 500 and quickly diagnosed that new short links can't be created if you create a new tag with it. Without tags, they got created just fine.
Here are the logs of Shlink when I tried to create a new short link with a tag:
Okay so it seems to be Redis problem. I try to restart all Shlink Fargate tasks to see if that fixes it. I log the startup sequence and see these errors, and the Shlink Fargate tasks do not start up:
Luckily we run Shlink in three different environments, in addition to production we have staging and dev environments too which are identical to the production env. The exact same problem occurs in all of the environments so I dig deeper in the dev env.
Next, I stop the Shlink Fargate tasks. Then I command a FLUSHALL to the Redis Cluster to clear all data. I try to restart Shlink and no dice, the same error appears on startup. Then I restore an RDS snapshot before the AWS incident and restart Shlink. Still, the same error at startup. Now I create a completely blank new MariaDB database instance and once again FLUSHALL Redis to get a properly clean slate. Still, the same error occurs on Shlink startup!
The next step for me was to downgrade to Shlink version 4.3.1. This fixed it all! I did the downgrade in staging and production envs too, and all was good once again! Shlink started up nicely without errors, new short links could be created with tags and so on.
So the weird part is that originally I got 4.4.0 running just fine when upgrading from 4.3.1. But any kind of a connection error or whatsoever breaks 4.4.0 permanently (for us at least). Back in 2023, when I did some disaster recovery testing for Shlink, there was basically nothing that broke down Shlink completely.
So I have no idea what is going on here, based on the logs it has something to do with Redis, right? Hopefully you can tell what's the issue. But for now, we are staying in version 4.3.1.
Expected behavior
Shlink recovers from infrastructure-related problems and does not break down permanently.
Minimum steps to reproduce
Unfortunately, I can not tell a definitive way to reproduce the problem, but these steps could lead to the same outcome:
The text was updated successfully, but these errors were encountered: