Description
Logically, each row in CockroachDB has a crdb_internal_origin_time and a crdb_internal_mvcc_time. Physically, the crdb_internal_mvcc time is encoded as part of the Pebble key and the crdb_internal_origin_time is encoded using the mvcc header.
Logical replication uses the crdb_internal_origin_time if it is present and falls back to the crdb_internal_mvcc_time if there is no origin time. As of 25.2, all write operations the incoming origin time is < COALESCED(crdb_internal_origin_time, crdb_internal_mvcc_time).
Validating origin time on all operations runs into an issue with index backfills. Index backfills don't set the origin timestamp on the index. As a result, a replicated write can generate a spurious LWW failure when it attempts to delete the index key.
Consider the following table:
CREATE TABLE replicated (
id STRING PRIMARY KEY,
value STRING PRIMARY KEY)
SELECT *, crdb_internal_origin_time, crdb_internal_mvcc_time
("foo", "foo-value", t1, t3)
This is encoded in a key that looks something like
/<TableID>/1/"foo":t3 -> t1,"foo-value"
If we add an index on value, it generates a second key:
/<TableID>/2/"foo-value"/"foo":t5 -> {}
If LDR attempts to replicate the row ("foo", "foo-value-2", origin_time: t4)
, that
row would need to delete /<TableID>/2/"foo-value"/"foo":t5
, but it would fail origin time validation and the write would be dropped as a LWW loss.
Fix Ideas
- We could undo the KV change that validates origin time on all operations. We would need to enhance SQL inserts to issue cputs with origin times.
- We could change schema changes so that they set the origin time mvcc header on indexes.
Jira issue: CRDB-50396