S3 Dependency for Synchronous Replica Recovery During Failover #1089

elesueur · 2025-02-26T01:13:04Z

We have a product configuration with Postgres configured with 2 synchronous replicas (3 postgres in total) and synchronous_mode_strict enabled.

The 3 Postgres pods are spread across 3 nodes (virtual machines) in a Kubernetes cluster.

We have configured WAL-G to backup to an internal S3 service backed by SeaweedFS and WAL files are pushed there every 15 mins. There are various pieces to SeaweedFS, and they can be randomly distributed across the 3 cluster nodes.

We are testing failover time to recovery (PG to be writable) when a node is lost. When a node is lost, we see promotion of a synchronous replica to primary require calling to S3 (wal-g wal-fetch). If S3 is unresponsive, the promotion is stalled until it becomes available to read the WAL archives.

The S3 service is not properly HA yet, and may be down for several minutes when one of the cluster nodes is lost. There is a significant cost to making SeaweedFS HA (additional pods consuming memory/cpu and data replication across multiple disks).

We are trying to understand if there are strong reasons why access to S3 is needed to promote synchronous replicas in the event of a failover, and whether it can be corrected by making a change to how Spilo accesses WAL files during recovery.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

S3 Dependency for Synchronous Replica Recovery During Failover #1089

S3 Dependency for Synchronous Replica Recovery During Failover #1089

S3 Dependency for Synchronous Replica Recovery During Failover #1089

S3 Dependency for Synchronous Replica Recovery During Failover #1089

Comments