Problem: Finding replay_lsn with lsn tracking is 7.5 times slower #731

arajkumar · 2024-04-01T10:48:33Z

Currently, we maintain a LSN mapping which tracks the target insert LSN against current DML operation's LSN. This mapping will be used to determine the replay_lsn. The mapping is implemented using LIST and every insert executes pg_current_wal_insert_lsn procedure on target server. When there is a need find current replay_lsn, another procedure pg_current_wal_flush_lsn executed on target and respective source LSN is determinted by iterating the mapping LIST. Additionally, we persist the list into a sqlite table on each sentinel sync leading more delay due to local disk write.

Solution: Use replication origin progress lsn as replay_lsn

We create postgres replication origin on target and use it to track the apply progress which includes source operation LSN. We can simply reuse the origin progress tracking to find the replay_lsn. By doing this, we eliminate a Postgres procedure call on each transaction and bunch of local disk write by sqlite. Removing the above mentioned functionality improves the overall write throughput by ~7.5 times.

Without fix(main):

time pgcopydb stream apply --resume --not-consistent ~/dimitri/bench.sql
09:48:19.678 112637 INFO   Running pgcopydb version 0.15.38.gc1bc3d9 from "/home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/pgcopydb"
09:48:19.726 112637 INFO   Using work dir "/tmp/pgcopydb"
09:48:19.793 112637 INFO   Setting up previous LSN from replication origin "pgcopydb" progress at 0/0
09:48:19.868 112637 INFO   Replaying changes from file "/home/ubuntu/dimitri/bench.sql"
pgcopydb stream apply --resume --not-consistent ~/dimitri/bench.sql  9.25s user 11.48s system 7% cpu 4:54.31 total

122987 / 4m54s = 419 records/sec

With fix:

time pgcopydb stream apply --resume --not-consistent ~/dimitri/bench.sql
10:01:42.701 113419 INFO   Running pgcopydb version 0.15.38.gc1bc3d9 from "/home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/pgcopydb"
10:01:42.750 113419 INFO   Using work dir "/tmp/pgcopydb"
10:01:42.814 113419 INFO   Setting up previous LSN from replication origin "pgcopydb" progress at 0/0
10:01:42.889 113419 INFO   Replaying changes from file "/home/ubuntu/dimitri/bench.sql"
pgcopydb stream apply --resume --not-consistent ~/dimitri/bench.sql  13.22s user 5.46s system 47% cpu 39.205 total

122987 / 39.2s = 3,137 records/sec

TODO

Remove lsn_tracking functionality

dimitri · 2024-04-02T12:49:35Z

I'm not sure about your work here. The idea is that we are using { "synchronous_commit", "off" } when applying changes on the target, in order to avoid waiting for the disk flush at each transaction commit. Then we can't take the replay_lsn (nor the origin lsn) at face value, because at COMMIT time we did not wait to make sure the replayed transaction was indeed committed to disk yet.

Our code is a little complex around this, the goal is to report a trustworthy replay_lsn to the source system and replication slot, in order to ensure we can still replay any failed transaction later.

Maybe you could try removing all the code around the LSN mapping and using { "synchronous_commit", "on" } in the replay connection and see what are the performance implications of that?

arajkumar · 2024-04-02T14:27:26Z

I'm not sure about your work here. The idea is that we are using { "synchronous_commit", "off" } when applying changes on the target, in order to avoid waiting for the disk flush at each transaction commit. Then we can't take the replay_lsn (nor the origin lsn) at face value, because at COMMIT time we did not wait to make sure the replayed transaction was indeed committed to disk yet.

@dimitri Since we read the origin progress using a separate postgres connection, isn't the origin progress LSN representing a txn which is committed to disk?

dimitri · 2024-04-02T16:36:07Z

@dimitri Since we read the origin progress using a separate postgres connection, isn't the origin progress LSN representing a txn which is committed to disk?

So... the origin LSN is valid in the context of the source database. The commit-to-disk aspect is asked in the context of the target database, where the transaction has its own (different) LSN. The replay transaction is associated with a source LSN and happens on the target database with synchronous_commit set to off, which means that when the COMMIT query is sent and the code receives an answer, we don't know yet if the COMMIT made it safely to disk on the target.

That's why we have this complexity here. We want to make sure to report a replay_lsn to the source database that has already made it safely to disk on the target database, and that needs to be a source lsn, not a target lsn. The only way to bypass this complexity that I can see is to use synchronous_commit set to on, which means that the source LSN being replayed is safe to publish as the replay_lsn to the source system right at COMMIT response time (on the target).

arajkumar · 2024-04-03T06:14:44Z

@dimitri, let's consider a scenario where we're replicating 100 transactions from source to target with SET synchronous_commit=off. While sending those 100 transactions, we also update the origin progress with each transaction's commit LSN (from the source). Suppose that, so far, the target has co 8000 mmitted up to the 50th transaction to disk. If we run SELECT pg_replication_origin_progress('pgcopydb', true) on the target, wouldn't it return the LSN of the 50th transaction, which could then be considered as the replay_lsn?

dimitri · 2024-04-03T12:53:27Z

@dimitri, let's consider a scenario where we're replicating 100 transactions from source to target with SET synchronous_commit=off. While sending those 100 transactions, we also update the origin progress with each transaction's commit LSN (from the source). Suppose that, so far, the target has committed up to the 50th transaction to disk. If we run SELECT pg_replication_origin_progress('pgcopydb', true) on the target, wouldn't it return the LSN of the 50th transaction, which could then be considered as the replay_lsn?

Given Postgres docs for the function, it might work, yes. I am not sure it does though. I do not know how to compare the wording “flushed to disk” with the asynchronous_commit aspects of what we do. What kind of testing did you do around it? Did you study Postgres source code already to better understand the guarantees here?

Returns the replay location for the given replication origin. The parameter flush determines whether the corresponding local transaction will be guaranteed to have been flushed to disk or not.

arajkumar · 2024-04-05T11:25:20Z

@dimitri, I spent sometime with the postgres source, as per my understanding, SELECT pg_replication_origin_progress('pgcopydb', true) would ensure the local txn is flushed before returning the progress LSN.

Please refer this code snippet => https://github.com/postgres/postgres/blob/6f132ed693b6a0be0997f092c42abf14878362ac/src/backend/replication/logical/origin.c#L1044-L1045

Additionally, I wrote a simple SQL test to confirm,

Create replication origin

psql "$PG_URI" -f << EOF
SELECT pg_replication_origin_create('pgcopydb_test');
CREATE TABLE IF NOT EXISTS test(id int8);
EOF

Advance replication origin and verify it is always less than flush lsn

psql "$PG_URI" -f - << EOF
SELECT pg_replication_origin_session_setup('pgcopydb_test');

BEGIN;
INSERT INTO test VALUES(1);
SELECT pg_current_wal_insert_lsn() AS lsn \gset
SELECT pg_replication_origin_xact_setup(:'lsn'::pg_lsn, now());
INSERT INTO test VALUES(2);

SELECT  pg_replication_origin_progress('pgcopydb_test', true) < pg_current_wal_flush_lsn();
COMMIT;
EOF

I had used pg_current_wal_insert_lsn as a origin LSN for the ease of demonstration.

dimitri · 2024-04-05T11:43:14Z

Thanks for checking on that @arajkumar ! Next step would be to get rid of the whole lsn tracking infrastructure in the code then, including the SQLite catalogs entirely. Can you update your PR to get rid of all the extra code that's not needed anymore?

Currently, we maintain a LSN mapping which tracks the target insert LSN against current DML operation's LSN. This mapping will be used to determine the replay_lsn. The mapping is implemented using LIST and every insert executes `pg_current_wal_insert_lsn` procedure on target server. When there is a need find current replay_lsn, another procedure `pg_current_wal_flush_lsn` executed on target and respective source LSN is determinted by iterating the mapping LIST. Additionally, we persist the list into a sqlite table on each sentinel sync leading more delay due to local disk write. Solution: Use replication origin progress lsn as replay_lsn We create postgres replication origin on target and use it to track the apply progress which includes source operation LSN. We can simply reuse the origin progress tracking to find the replay_lsn. By doing this, we eliminate a Postgres procedure call on each transaction and bunch of local disk write by sqlite. Removing the above mentioned functionality improves the overall write throughput by ~7.5 times. Without fix(main): ``` time pgcopydb stream apply --resume --not-consistent ~/dimitri/bench.sql 09:48:19.678 112637 INFO Running pgcopydb version 0.15.38.gc1bc3d9 from "/home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/pgcopydb" 09:48:19.726 112637 INFO Using work dir "/tmp/pgcopydb" 09:48:19.793 112637 INFO Setting up previous LSN from replication origin "pgcopydb" progress at 0/0 09:48:19.868 112637 INFO Replaying changes from file "/home/ubuntu/dimitri/bench.sql" pgcopydb stream apply --resume --not-consistent ~/dimitri/bench.sql 9.25s user 11.48s system 7% cpu 4:54.31 total ``` 122987 / 4m54s = 419 records/sec With fix: ``` time pgcopydb stream apply --resume --not-consistent ~/dimitri/bench.sql 10:01:42.701 113419 INFO Running pgcopydb version 0.15.38.gc1bc3d9 from "/home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/pgcopydb" 10:01:42.750 113419 INFO Using work dir "/tmp/pgcopydb" 10:01:42.814 113419 INFO Setting up previous LSN from replication origin "pgcopydb" progress at 0/0 10:01:42.889 113419 INFO Replaying changes from file "/home/ubuntu/dimitri/bench.sql" pgcopydb stream apply --resume --not-consistent ~/dimitri/bench.sql 13.22s user 5.46s system 47% cpu 39.205 total ``` 122987 / 39.2s = 3,137 records/sec Signed-off-by: Arunprasad Rajkumar <ar.arunprasad@gmail.com>

Signed-off-by: Arunprasad Rajkumar <ar.arunprasad@gmail.com>

arajkumar · 2024-04-05T14:06:12Z

@dimitri, I removed all the unwanted lsn tracking infra.

dimitri · 2024-04-05T14:12:14Z

Thanks again for all the work @arajkumar !

arajkumar force-pushed the arajkumar/origin-progress-as-replay-lsn branch from 0d243ff to 48663f9 Compare April 3, 2024 06:15

arajkumar added 2 commits April 5, 2024 13:49

Remove LSN tracking infra

8fc6a40

Signed-off-by: Arunprasad Rajkumar <ar.arunprasad@gmail.com>

arajkumar force-pushed the arajkumar/origin-progress-as-replay-lsn branch from 48663f9 to 8fc6a40 Compare April 5, 2024 14:00

dimitri approved these changes Apr 5, 2024

View reviewed changes

dimitri assigned arajkumar Apr 5, 2024

dimitri added the enhancement New feature or request label Apr 5, 2024

dimitri added this to the v0.16 milestone Apr 5, 2024

dimitri merged commit 3f07e0b into dimitri:main Apr 5, 2024

dimitri mentioned this pull request Apr 23, 2024

[SQLite 5] database is locked #734

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Problem: Finding replay_lsn with lsn tracking is 7.5 times slower #731

Problem: Finding replay_lsn with lsn tracking is 7.5 times slower #731

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Problem: Finding replay_lsn with lsn tracking is 7.5 times slower #731

Problem: Finding replay_lsn with lsn tracking is 7.5 times slower #731

Uh oh!

Conversation

Uh oh!

Without fix(main):

With fix:

TODO

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Create replication origin

Advance replication origin and verify it is always less than flush lsn

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!