Replies: 26 comments 59 replies
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
-
Another issue with memcs is memory duplication. In case of strings for example if we use developed generic build then new index will have duplicates values for string columns because we get pk rows as tuples and then parse these tuples back to insert into new index. I guess we need to work out some interface to be able to pass rows without serialization/deserialization to tuples when iterating and inserting, so that we can use generic index build in memcs. Currently memcs string are duplicated in indexes but we are going to fix this (https://github.com/tarantool/tarantool-ee/issues/1102). This issue is similar to mentioned in https://github.com/orgs/tarantool/discussions/11018#discussioncomment-12642403. Yet wanted to write it down. |
Beta Was this translation helpful? Give feedback.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
-
This approach suffers from one annoying issue. The issue lies in how this RFC proposes to track duplicates:
This approach cannot prevent the occurrence of duplicates when memtx/vinyl MVCC is enabled. Imagine that the In MVCC, duplicate checking is also not performed in a single step. For the inserted tuple, a separate story is created and placed in the corresponding chains for each index. Then, for each "in progress" story Thus, this approach, at the very least, requires the implementation of more complex mechanisms for the building index, similar to those used in MVCC. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Reviewers
Tickets
Changelog
build=<mode>
towait=<bool>
. Wrapped the rejected alternatives into drop-downs, to make the document smaller. Added a new open question aboutwait
name._index_local
state is dropped in case of successful build, temporary indexes are explicitly mentioned.space::on_replace()
trigger now.Summary
There are 3 large problems right now. The document tries to address them all in a single solution.
Problem 1 - replication
When a space is big enough, building a new index on it can be quite long. Minutes, hours, depending on the space size. The same is about index alter - it might require index rebuild, space fullscan. That isn't a big deal locally on the instance, because the build is asynchronous - transactions can still be processed, even on the changing space.
But it gets complicated in a cluster due to the following reasons.
Replication gets stuck in a replicated cluster. Yes, the index build is async fiber-wise, but it blocks the current fiber. The blockage happens on-replace into
_index
space. Not on-commit. Because of that the applier's feature of committing the txns asynchronously doesn't help. The longest part happens before the commit.The replica's lag will grow, it won't receive any new data until the build is finished. But the replication still is alive, and at least it doesn't block the transaction processing on the master when the replication is asynchronous. Unlike the next problem.
Master transaction processing gets stuck in a synchronously replicated cluster. Because the index build transaction on the master blocks the limbo until the appliers also apply it and write to their WALs. And that will last until the quorum of replicas have finished the index build.
Essentially, in a synchro cluster with large spaces it becomes impossible to create new indexes. It requires hacks, like creating a new space with all the needed indexes and same format, then slowly copy the data from the old space, in multiple small transactions, then delete the old space. Sounds not complex really, but it requires the user to change their code to maintain this "migration" process by writing into both old and new spaces while the copying is in progress.
Problem 2 - concurrent rollbacks
The idea with the yielding index build (it yields every few thousands of tuples inserted to let other fibers work) was that it would allow to build large indexes without stalling the entire TX thread.
But it adds concurrency problems. Like what happens if the building index took a tuple, which was rolled back later? Some of those cases are already handled in the code. Normally such tuples are also deleted from the index. However in a few situations this won't happen: #10929, #10930, #10931.
Those all seem to be united by a shared fundamental issue: the building index is invisible to transactions. They implicitly add tuples into it (via
space:on_replace
trigger), but they can't delete tuples from it (because there is no such trigger asspace:on_rollback
or something, and the:on_replace
isn't called again for rolled back statements, which would be wrong to do anyway).This brings all sorts of nasty bugs in, like when a txn was started during index build, its commit was started, then index build is committed, and then the txn gets rolled back due to WAL write error or because the synchronous replication leader was changed and cancelled this txn. To normal indexes such events have access via
struct space::index
array, and can clean their crap out. The building indexes OTOH are invisible and sometimes are keeping the trash tuples in.Problem 3 - nontrivial index alter is unpredictable for users
Imagine an index had 2 key parts. The user's code was using
index:get/delete/update
based on these 2 parts. Types don't even matter.Now imagine the user wants to make the index have 3 parts. The user would do
index:alter({parts = new_3_parts})
.In both async and sync replication this transaction is going to become visible on the master earlier than on the replicas. It means, the nodes in the replicaset are having different definitions of the same index for a while. Code working on one node, like
index:get{part1, part2}
, won't work on another node. And then at a random point in time would stop working on the first node too.The user can't switch their code to the new key parts at once or in any other controllable way besides adding checks right in the code for how many parts does an index have, and acting depending on that.
This is very unhandy. To the point that doesn't seem like anybody is using such backward incompatible alterations in prod.
OTOH, the alter of the sorts
integer -> number
type change orunique -> non-unique
are fine. They get executed immediately because don't require an index rebuild or re-check, and the old code works with the new definitions too (besidesindex:get()
when uniqueness is lost, but it can be replaced with:select()
before the alter).✅ Solution: non-blocking background index build
Lets see what can be done on the high level.
Consider the first problem - a long index build blocks replication because the transaction can't be committed until the index is built. The solution is right here - allow to commit the transaction before the index is complete. Build the index lazily, in "background" (not blocking neither the current fiber nor the transaction), after the commit. The index entry is added to
_index
instantly and the txn gets committed as fast as if the space was empty.Coincidentally, this also contains the key to the solution of the second problem - the index becomes visible in
_index
andstruct space::index
array instantly, before its build is started. It means, that if the index would start building after commit into_index
, it would automatically 1) apply all new transactional statements to the already inserted keys, 2) apply rollbacks for the already inserted keys from all the rolled back transactional statements. Because all of that is just DML, and DML always iterates over the entirestruct space::index
, both on transaction statement creation and on rollback. Seememtx_space_replace_all_keys()
for Memtx, andvy_replace()
and the other similar ones for Vinyl.The presence of alter directly affects how complicated this idea is going to be in details. If alter exists and an index alter/create is instantly visible, then each index would have 2 states: actual and target (build is in-progress), and they would have to be both maintained in at least the system spaces and in the internal objects like
struct index
. Besides, no matter how to design this, the original issue of nontrivial alter being unusable in a replicaset won't go away. So the proposal is to forbid nontrivial index alter starting from 4.x version. This simplifies the rest of the proposal and won't allow the users to shoot their own legs off.Since the non-trivial alter support is being deleted entirely, it means the users can't upgrade to 4.x if their xlogs contains such alters. Thus the users potentially having alters in xlogs must stop doing more of such operations, do a snapshot, and only then upgrade to 4.x. In order for that not to be a surprise Tarantool 3.x will start printing a warning when a non-trivial alter happens, that such operation won't be possible in 4.x.
The same problems existing for space alteration (it might require to validate all data which makes such transaction long and blocks the replication) are ignored for now. It is not considered as important as index creation issues.
Regarding trivial alter - it includes not only type extensions and uniqueness drop, but also engine-specific settings which don't require rebuild or validation of the entire index. For Vinyl that would be settings like page size. For memtx - hints in the tree index. The hints are going to be allowed to be changed, but the change will only take effect after snapshot + restart. Because hints on/off requires a rebuild, which won't be possible in 4.x on the fly.
The remaining part of the document tries to elaborate this new way of building the indexes in greater details.
Build state
Image that an index is being built after a commit into
_index
. This raises the following questions:Assume the replication doesn't exist yet. It is a good idea to solve those issues as conveniently as possible in scope of one instance, and then try to scale the best solution across the replicaset.
Lets introduce some sort of index build state. It should provide the following:
box.cfg()
completes._index
directly. It would also be nice not to break_index
too much.The proposal is to create a new replica-local space
_index_local
. It has the following format:[space_id, index_id, state]
.The
state
field describes the index's local state when it doesn't exactly match the row in_index
. Currently it would only contain the following:It is extendible, since this is a map. And can be indexed using JSON indexes if needed in the future. For example, it might be handy in the future to support nontrivial alter again.
When an index is created, it is instantly committed into
_index
space as it is now and into_index_local
space within progress
build state. The goal of the instance is to build the index so it reaches fail or success state eventually. If fails, thestate.build
automatically becomesfail
and thestate.error
is filled with an error object encoded in MessagePack. In case of success the row is just deleted from_index_local
.The
_index
space format and all its fields meaning remain as is. The only difference is behavioral - commit happens instantly, but the index itself isn't usable until fully built later.An example: imagine there is a space with an ID
512
. A new index is created for it:After a while it gets built and the tuple is dropped from
_index_local
:Now imagine the parts are changed to
{{2, 'number'}}
. This is how it gets reflected in the tuples:Note, how the
_index
is just changed right away, and_index_local
wasn't affected. Because a rebuild wasn't needed and becomes not possible anyway. So the local and global states of the index match, and the local one is then just dropped.DDL validation and automatic alteration
Imagine the above is done. This means, that insertion and deletion operations on
_index
space would have to also touch_index_local
space. Including potentially existing old code from users who could be touching_index
space directly.Users must not do that, yes (touch system spaces directly), but it is easy to think of a usecase - a backup tool. If one exists out there, it might be scanning each space in Tarantool, including
_index
, not even looking which of them are system or not, saving the tuples somewhere, and then inserting them again on a new instance to restore from the backup.If the new build way would require to manually update
_index_local
, it would break such simple but useful apps.Luckily, we can make
_index_local
update automatically, right from_index:on_replace()
trigger. It is not hard to do.Given such drastic changes in the index build way, it makes sense to list all the new DDL validations and hooks:
number -> integer
is no longer allowed._index
, the trigger must make a replace into_index_local
to create its build state which is going to bein progress
._index
, the trigger must also delete it from_index_local
._index_local
, it is only allowed if all of the following is true:_index
.state.build
isin progress
.state.error
isnil
._index_local
, it is only allowed if there is no more entry for it in_index
or the index is successfully built._index_local
tuple can only be done for the following cases:state.build
fromin progress
tofail
, and the index must be actually failed. Again, will be visible instruct index
.state.error
fromnil
to an error object, and thebuild
must befail
then.After the nontrivial alter deprecation and removal, the checks above will be even simpler than the current state of alter.cc.
All those changes for
_index_local
space and validations of everything are automatic and happen in_index:on_replace()
triggers.It means, if there is any code doing
_index:update/delete/replace/...
, it would keep working. With nuances, but simple things will remain functional.Access
Right now if an index is in
struct space::index
array, then it is usable. And its vtab points at Vinyl's LSM or Memtx hash or tree functions. Vtab would only depend on the index type.Assuming an index is added to
struct space::index
instantly, even before it is complete, this can't be left like that. Otherwise incomplete indexes would be readable, which is not acceptable. And yet the DML transactions and their rollbacks must be allowed to happen on such triggers, to a limited extent (such as ignore updates happening on not yet indexed tuples or rollback changes done on already indexed tuples).To support that there is going to be a new type-agnostic index vtab. It replaces
struct index::vtab
while the build is in progress. It saves the original type-specific vtab somewhere, and proxies to it all the updates that are allowed to be done. All the DQL (get, select, pairs) are going to be rejected at the level of this new vtab.When the build is complete, the proxy-vtab is dropped, and the index becomes fully usable, if build was successful. Otherwise if the build has failed, the vtab rejects all DQL with an error.
The DML is at the engine-level, so about that there is a section further in the document.
Replication
Luckily, the approach above will just work in a replicaset without any needed changes on top. It wasn't so simple in the previous iterations of the document, but the currently proposed way is already replication-friendly.
Indeed, assume a master added a new tuple to
_index
space. It gets committed immediately, and then the build works in the background. The master-local_index_local
space is updated via_index:on_replace()
.The
_index
tuple gets replicated and reaches the replicas. They will also execute their own_index:on_replace()
triggers and update their_index_local
space. From this moment it all works same as on the master.Lua and other APIs
The whole idea of going through all those vast changes was that the Lua API of the index creation and alteration won't change. Not even just the methods of the index Lua object, but ideally even its "low-level" API - the
_index
space operations. How it was working before should keep working in the new versions, at least the simple cases, and yet not have the problems described in the beginning of the doc.It means that indexes created as
space:create_index(...)
would internally start using the new way of index creation.:create_index()
method would now launch index creation by committing it to_index
space, and then block the current fiber, to wait until the index is built successfully. From user's PoV it would look same as before, on the success path.To support non-blocking index build there is going to be a new option:
space:create_index(name, {wait = true/false, other_opts...})
-wait
. It is a flag which defines whatcreate_index()
should do - wait for the index to get created and built (wait = true
, default), or start its build and leave (wait = false
).If
wait = true
, and the index build fails, thenspace:create_index()
will also drop the index. So it doesn't remain in a failed state, clogging_index
space, hanging there forever. Ifwait = false
, then the failed index remains hanging in the spaces_index
and_index_local
until it or the entire space are dropped.The option only regulates how
space:create_index()
works. It is not propagated to_index
space and isn't stored anywhere.The changes, which would become visible to the users in any non-trivial setup are the following:
space:create_index()
fails before reaching the end of the waiting, or the instance gets restarted, then the failed index will stay in_index
space. It might even succeed in building, if the instance was restarted, and after restart the index has finished the build successfully. In other words,space:create_index()
is not transactional anymore._index:insert()
would get executed immediately, but won't actually produce a fully working index.space:create_index()
on success does not guarantee anymore, that with the synchronous replication the index is created on a quorum of replicas. Previously it was ensured, but now the index creation success only means local success with eventual consistency on the replicas.index:alter()
is no longer possible for any non-trivial changes.The absolutely non-problematic API changes will happen in Lua index object - it will get a new field
state
which will show the build status and an error if the build has failed.Also the index will get a new method
:wait(timeout)
. It will block the current fiber until the index build is finished. If this was an error, then the method will throw the error. Otherwise return true. On timeout the function returns false.Internals
Here are some implementation details that are worth mentioning before the implementation starts.
Temporary spaces
Some spaces are fully temporary. I.e not only they are replica-local, but also aren't even stored in WAL. They are visible in
_space
, but absent in the xlogs and aren't saved into the snapshots.Their indexes are visible in
_index
, but are also temporary. I.e., not persisted.The new space
_index_local
must take that into account. All its updates related to temporary indexes must also skip WAL and snapshots.Worker fibers
It is suggested to create one fiber for each background index build, and store it directly in
struct index
(maybe not directly, but in some new member ofstruct index
which describes the entire build state). This would be easier to control in the code, when the entire index state is in one place, easily visible directly in each index object. The fiber would be forcefully cancelled and joined when the index gets dropped. And the state would delete itself automatically, if the build completes to the end (success or fail, doesn't matter). The fiber belongs to the system - it is visible but not cancellable or joinable by the users.Having one fiber per index also would simplify storing build context and making
_index_local
transactions. If the fiber would be one for all indexes, it would have to store some sort of a context for each index anyway, attach/detach txns/read-views/whatever, and this would look clumsy.In the end of work the fiber will try to update the
_index_local
tuple with the actual state or drop it. If the txn fails due to any reason (WAL, conflict), then the fiber will retry every few seconds.It is still an open question though. In reality it might happen, that having fiber per index would be harder to control, to shut them all down at once for example, when the instance is being shut down. This might change during the implementation.
Generic build algorithm
The current code of index build is engine-specific. See
memtx_space_build_index()
andvinyl_space_build_index()
. They are mostly identical really, by the main logic. Just a few tweaks are done in each of them to support the Memtx and Vinyl transaction managers, isolation, and various recovery stages.The recovery-stages tweaks seem to be replaceable by
enum box_recovery_state
, which could unify the build process, and the code would fallback on recovery to some new methods likeengine->recover_index()
or something.The transactional tweaks seem to be coming mainly from the necessity to build the index in a "read view", in a single long transaction. They try to protect the new index from rollbacks of existing transactions by doing
wal_sync()
, by aborting concurrent txns on the same space, etc, but it doesn't really work as seen in the tickets mentioned in the beginning of the document.The worst part is that it can't even possibly work okay when synchronous replication is enabled. Because for that the old way of index building has to wait until all the limbo transactions are confirmed, to be sure their tuples won't be rolled back. However while the new index is waiting for the limbo confirmation, the limbo might get new transactions! This is a problem of Achilles who can never catch up with the turtle. It could be resolved by freezing the limbo entirely while the index is waiting, or by trying to commit the index enabling via the limbo too. But that all doesn't sound very reliable and requires the index build process to be aware where are the transactions spending time before the commit (WAL, limbo, what is next?). Besides, a limbo freeze would affect the master's behaviour for regular unrelated non-DDL txns.
Instead, the new way of index build suggests to think of it as a "functional index where the function is true for all keys < current build key", and the function's key gets updated gradually, moving forward. It is logically identical to creation of a new space having the new index as a primary key and copying all tuples to there from the old index, while having
:on_replace()
on the old space to catch updates and add repeat them for the new space in the same transaction.Here is the proposed algorithm:
struct space::index
array on commit into_index
. It becomes visible everywhere, but not available for DQL (at theindex_vtab
level).on_replace
trigger for that space. The trigger must do the following:on_commit
trigger for that statement.on_commit
trigger must abort the new index build.nil
(= the minimal possible key).READ_CONFIRMED
isolation. On each iteration:Pros: engine-agnostic besides perhaps some assertions which might have to be added in the engines' code.
Cons: uses
space::on_replace
trigger which will make user's DML on the Vinyl space always do reads.A couple of details about the above written procedure:
txn_stmt::on_commit
trigger? Why do that right inside the transaction? Because in Vinyl it might require reading, which can yield, andon_commit
triggers can't yield.We then get duplicates where the new tuple is
{3, 2}
and the old one is{5, 2}
. The old's PK is greater, but it is still a bad duplicate with with the build must be aborted.However this won't make the code entirely engine-agnostic anyway. To support this way of building the engine must provide a way to "insert" into an index, not into the whole space. Most likely it can be done by implementing the
index_vtab::replace
for Vinyl or adding a new method specifically for building. Luckily, it won't have to support anything but just insertion. No deletes, updates, or anything alike.Upgrade and schema
The script
upgrade.lua
becomes problematic, because to make an initial snapshot on 4.x it would have to create the initial schema from 1.7.1 and then execute all the updates of all the versions until 4.x, including alterations. If so, then the file would have to be squashed to only support the initial schema of the latest 3.x as the first schema of 4.x.Besides, there is a dummy schema created in
schema_init()
right now. It creates some system spaces with default settings (like that all indexes are trees) which mismatch what is later applied from the snapshot, which triggers an alter. Soschema_init()
will have to be rewritten to create all the spaces in their latest versions.👌 Answered questions
What to do with
_vindex
? Should it work same as now? Returning the build state from it wouldn't change the old connectors who won't read that state anyway. So as long as an index is returned from_vindex
, no matter its state, the old connectors will keep seeing this index as active. I propose not to touch_vindex
at all then.How would a hash index work? Firstly, it only exists in Memtx. Secondly, if the primary index is not hash, then build of a secondary hash index works same as a tree index. But if the hash index is the primary one, then it is problematic - the build code can't yield. Because in a hash index there is no order. Iteration of the primary index has to be done with no yields, because otherwise during a yield some insertion might happen, and there is no way to see if it happened on already processed tuples or on not yet processed ones. The primary index build key makes no sense, basically.
It works the same way in the current way of index building. And would remain like this. Just that this yield-less build would now happen after commit into
_index
, not before.Alternatives
♻️ Solution: alter support and better backward compatibility
The v6 version of the document proposed a solution which would keep nontrivial alter support, and wouldn't require any new spaces.The key idea was that an index can have 2 states: actual and target. The actual is the replica-local state, the one being usable for DQL. The target is the global state set by the master. The replicas are trying to reach it by building the target version and eventually replacing the actual one with the target one.
The replica-local state was stored in a new field in
_index
space. The field was storing the replica-local state and was itself also replica-local. I.e. the statements touching it would be havingGROUP_LOCAL
replication group.The solution wasn't accepted, because was overly complicated in its attempts to keep the nontrivial alter alive, which apparently isn't even needed.
♻️ Solution: lazy index with auto-complete
The currently chosen solution with "lazy indexes" requires the user not only to create the lazy index, but also to manually make it non-lazy when the build is complete.Why not make the index automatically update the build mode, when it gets ready? Just make an internal transaction, which changes the
opts.build
to'now'
(orwait = true
, or whatever else is the naming in the primary solution atm).There is a few of reasons, why not:
_index
entry on multiple nodes). That doesn't look safe.♻️ Solution: nothing
The issue in the ticket isn't really a bug. It is an inconvenience, which has a workaround explained right in the intro - copy the space manually, write to both spaces, then replace the old one.The only problem is that the user would have to support that in their code.
Lets repeat the solution here for clarity. When a user wants a new index or alter an existing one in a non-trivial way, or to change a space format, they do this:
on_replace
trigger on the old space, which does the same work on the new space.Pros: don't need to do anything, already works.
Cons:
♻️ Solution: replica-local index
The problem of index creation/alter is hitting the replication hard. One approach could be to attack the replication shortcomings then. That is, drop the replication from the process.Lets imagine that the replicas and master could build the same indexes independently, fully local. And when finished, the master would in a single small DDL transaction "enable" this index.
The index creation would then be a 2 step process. 1 - create a local index on all replicas. 2 - turn the local index into a global one on the master.
This needs 2 features which aren't available yet, but aren't hard to add:
Replica-local DDL is not unusual for Tarantool. There is right now a space type
temporary
(not to confuse withdata-temporary
). It can be created on read-only replicas, can have its own indexes, is visible in_space
and its indexes in_index
, but it is not replicated, and its data isn't stored in WAL.Replica-local persistent data also is not a new thing. Tarantool does have "local" spaces. They have replicaset-global meta (
_space
and_index
rows) and their data is persisted, but not replicated. They can only be created by master, but can take DML on any instance and it is not replicated.The proposal is to introduce replica-local indexes. They can be created by any replica, even read-only, on absolutely any space. This index is persisted in
_index
and is not replicated.Creation of the index will not affect replication at all, and won't block the limbo, because replica-local transactions are not synchronous by definition.
To create a new global index, the user then would then go and create a replica-local index on each instance.
Then to make it global the user would on the master instance make
index:alter{is_global = true}
. Locally it works instantly. When this txn comes to the replicas, it will try to find a replica-local index in this space with all the same meta besides index ID. If found, it also works instantly, by changing the index ID to the global one (ID is primary, so it would mean moving local index's data to the new global index with the global ID, and dropping the empty local index). If not found, a new index is created as usual.The solution not only allows to create/alter indexes in the cluster bypassing the replication, but also allows the user to purposefully create replica-local indexes without ever making them global. It could be handy to reduce memory usage on the master and speed-up master's DML. Master would only store the unique indexes and handle DML, and the replicas would store the other indexes + serve DQL.
The cons is that the user has to visit each replica to create the replica-local indexes on the first step.
Pros: introduces a new feature - replica-local indexes, which can be used not only for replicaset-wide index building.
Cons: needs 2 steps, one of them to be done on each instance in the replicaset. Including new instances, where this index won't appear automatically.
♻️ Solution: space clone
This one is discarded, because it tries to re-invent `space:upgrade()` which is already available in EE.The proposal attacks the issue from another angle - if a space alteration is long, then users would typically, in another DB, make a second space with all the same meta + the alterations, and then copy the data from the old to the new one + repeat changes to the already copied keys. We could do the same, but automate it.
That is, Tarantool would allow to clone a space with any of its indexes and metadata altered. Once the cloning is done, the user could do the final "drop + rename" themselves.
If designed carefully, this could be an interesting tool to do more than just a new index creation, like:
In Lua code it could look like this:
More interesting outcomes:
Pros:
Cons:
♻️ Solution: force async
Why not make index creation and space format alter `TXN_FORCE_ASYNC` txns? I.e. they wouldn't block the limbo. Right before their completion on the master we could lock the limbo, wait until all its txns are confirmed, and then commit this one.Seemingly it could be safe - If a master at the time of this DDL commit had all the data of this space replicated and confirmed, it means all the replicas must be able to apply this format/index too.
There is waiting time until the limbo is locked and gets emptied, but it only depends on the replication and replicas' WALs speed. Not on the space size.
It wouldn't work right away, because the appliers on the replicas would still be executing those txns for a very long time, but this can be solved by executing them in their own fiber, not in the applier's main fiber. Perhaps this would bring other complications, and yet the solution looked worthy of consideration.
However it was rejected, because firstly, replicas might have different data because of
before_replace
triggers. Secondly, this introduces a point of potential split-brain. Any async txn in election=auto can cause a split-brain even with a proper quorum.Pros
Cons
⚒️ Build: with engine-specific parts, intercepting every DML, catch-up via `txn_stmt::on_commit()` triggers.
❗️ Below is unchanged initial algorithm of the index build proposed in some older versions of the RFC.
Pros: doesn't use
space::on_replace()
trigger so the Vinyl space won't start doing reads on every DML just to fetch the old tuple for the trigger.Cons:
txn_stmt::on_commit()
triggers, which won't be possible for Vinyl due to yields on reads (any on-commit code can't yield). This part has to be reworked if the solution is brought back due to any reason. For example, actually do the DML before the commit. If get a duplicate, then abort the index build intxn_stmt::on_commit()
trigger but don't abort the current transaction (it would affect the user's code).Further down is the algorithm.
struct space::index
array on commit into_index
. It becomes visible everywhere, but not available for DQL (at theindex_vtab
level).null
. I.e. the minimal primary index key.index_vtab
or some other way):READ_CONFIRMED
isolation. On each iteration:While sounding simple, it actually contains a lot of hidden icebergs in the engine waters. See the known ones below.
What happens if a txn statement greater than the build key was ignored, but then the index build process reached this key before the txn was committed?
The following build-key-related mix situations are possible:
The situations 1 is bad, because the build process took the old tuple, since it was reading with read-confirmed isolation. And the new one was ignored. Now the index is outdated in this key.
The situation 2 is ok, because the new tuple wasn't seen, and will never be seen by anybody anyway, because it was rolled back.
The situation 3 is ok as well. The index, when reaches this key, will just take whatever was the last confirmed version.
Lets inspect situation 1 closer then. When blindly skipping statements > build key, the problem is that on commit we lose the new version of the key's tuple. Neither Memtx nor Vinyl notify all indexes of a space when something gets committed into it. In Vinyl the situation is even worse than in memtx, because if a statement was skipped for a certain
vy_lsm
, and it has notxv
object, then on commit it won't be put into the LSM-tree's memory-level. It only happens for createdtxv
s.The only solution on the surface seem to skip the statements which are > build key, but install for each of them an on-commit trigger, right on the specific
txn_stmt
objects. The trigger would check if the statement at the moment of commit is already <= build key. And if so, it would be applied to the new index too.Logically that should resolve the key-mix problem. But implementation-wise it creates new problems. See about that in the next point.
Vinyl doesn't use index vtab for DML and has deferred deletes.
In case of Memtx the on-commit trigger mentioned in the previous section would just provide the old and the new tuple, and they can be simply applied on the new index. Delete old, insert new.
In case of Vinyl it is never simple.
First issue is that when any DML happens on Vinyl spaces, they go through
vy_delete()
,vy_update()
,vy_replace()
,vy_upsert()
. From the moment of entering those functions the code doesn't use index vtab. It is a problem, because now for some LSM-trees of a space (the in-progress ones) it becomes necessary to filter-out their DML. To apply changes <= build key, and set on-commit trigger for the others. This filtering being done at the index vtab level would just do nothing in Vinyl. Because Vinyl doesn't usestruct index_vtab::replace
.In some form that also has to stay like this, because in Vinyl it is quite expensive to read the old tuple for each DML for things like updates. It is important that it is read once from PK (if read at all), and then used for updating all the LSM-trees of that space without reading them (if they are not unique).
"Delete" statement in Vinyl doesn't even read the old tuple ever, regardless of uniqueness. It remains unknown. Replace can also skip the old tuple lookup entirely (if no indexes are unique besides the primary one).
That means, not only Vinyl can't just define
struct index_vtab::replace
and be done with it, but also its statement's on-commit trigger sometimes just won't haveold_tuple
specified for deletes or replaces. It would only haverequest->key
for delete andstmt->new_tuple
for replace and insert. The only safe case is the update, but that doesn't matter.That in turn means, that the index build has to have engine-specific code. So Vinyl's build-index-on-stmt-commit trigger could read one set of fields in
txn_stmt
and in Memtx it would just rely onold_tuple
andnew_tuple
. Although it would be probably simpler than the index build code is right now anyway. For Memtx the trigger will be delete old + insert new. For Vinyl the trigger has to observe thetxn_stmt->row
to find out the statement type and re-apply it the same way as the other LSM-trees did. I.e. make a deferred delete, blind replace, etc.The exact way of implementation, even if being turned down during the code review, could be redone relatively easy, if the main logic above is approved. Nonetheless, I could mention here some ideas to validate before the coding starts:
vy_delete()
deferred deletes can actually be skipped by the new index just fine. Such deletes affect all indexes implicitly and would even affect the newly building index too.for (uint32_t i = 0; i < space->index_count; i++)
must check if the index is being built. Such index would have a generic build context object of some sort + Vinyl-specific context. If such an index is met, then the DML is skipped (txv
is not created for it), and it sets an on-commit trigger for the currenttxn_stmt
.txn_stmt->row
statement.struct txn
transactions at all. Directly to the memory-level of the LSM-tree. If not possible, then a temporaryvy_tx
transaction is created just for this one statement, and immediately committed. Without any WAL entries, of course. If justvy_tx
is not enough, then an actualstruct txn
transaction can be fired.And since this work of "statement catch ups" is pushed down to the engine-level, then Memtx would also have to define such thing for its own indexes. Although it must be much simpler than for Vinyl.
The index build process would then activate this "catch up" and then start the build.
Beta Was this translation helpful? Give feedback.
All reactions