You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have a product configuration with Postgres configured with 2 synchronous replicas (3 postgres in total) and synchronous_mode_strict enabled.
The 3 Postgres pods are spread across 3 nodes (virtual machines) in a Kubernetes cluster.
We have configured WAL-G to backup to an internal S3 service backed by SeaweedFS and WAL files are pushed there every 15 mins. There are various pieces to SeaweedFS, and they can be randomly distributed across the 3 cluster nodes.
We are testing failover time to recovery (PG to be writable) when a node is lost. When a node is lost, we see promotion of a synchronous replica to primary require calling to S3 (wal-g wal-fetch). If S3 is unresponsive, the promotion is stalled until it becomes available to read the WAL archives.
The S3 service is not properly HA yet, and may be down for several minutes when one of the cluster nodes is lost. There is a significant cost to making SeaweedFS HA (additional pods consuming memory/cpu and data replication across multiple disks).
We are trying to understand if there are strong reasons why access to S3 is needed to promote synchronous replicas in the event of a failover, and whether it can be corrected by making a change to how Spilo accesses WAL files during recovery.
The text was updated successfully, but these errors were encountered:
We have a product configuration with Postgres configured with 2 synchronous replicas (3 postgres in total) and synchronous_mode_strict enabled.
The 3 Postgres pods are spread across 3 nodes (virtual machines) in a Kubernetes cluster.
We have configured WAL-G to backup to an internal S3 service backed by SeaweedFS and WAL files are pushed there every 15 mins. There are various pieces to SeaweedFS, and they can be randomly distributed across the 3 cluster nodes.
We are testing failover time to recovery (PG to be writable) when a node is lost. When a node is lost, we see promotion of a synchronous replica to primary require calling to S3 (wal-g wal-fetch). If S3 is unresponsive, the promotion is stalled until it becomes available to read the WAL archives.
The S3 service is not properly HA yet, and may be down for several minutes when one of the cluster nodes is lost. There is a significant cost to making SeaweedFS HA (additional pods consuming memory/cpu and data replication across multiple disks).
We are trying to understand if there are strong reasons why access to S3 is needed to promote synchronous replicas in the event of a failover, and whether it can be corrected by making a change to how Spilo accesses WAL files during recovery.
The text was updated successfully, but these errors were encountered: